publications | Marc-André Carbonneau

2024

Spoken-Term Discovery using Discrete Speech Units

Benjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau, and Herman Kamper

In INTERSPEECH, 2024

Abs Bib

Discovering a lexicon from unlabeled audio is a longstanding challenge for zero-resource speech processing. One approach is to search for frequently occurring patterns in speech. We revisit this idea by proposing DUSTED: Discrete Unit Spoken-TErm Discovery. Leveraging self-supervised models, we encode input audio into sequences of discrete units. Inspired by alignment algorithms from bioinformatics, we find repeated speech patterns by searching for similar sub-sequences of units. Since discretization discards speaker information, DUSTED finds better matches across speakers, improving the coverage and consistency of the discovered patterns. We demonstrate these improvements on the ZeroSpeech Challenge, achieving state-of-the-art results on the spoken-term discovery track. Finally, we analyze the duration distribution of the patterns, showing that our method finds longer word- or phrase-like terms.
@inproceedings{vanniekerk2024WD, author = {{van Niekerk}, Benjamin and Zaïdi, Julian and Carbonneau, Marc-André and Kamper, Herman}, booktitle = {{INTERSPEECH}}, title = {Spoken-Term Discovery using Discrete Speech Units}, year = {2024}, }
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues

Vandad Davoodnia, Saeed Ghorbani, Marc-André Carbonneau, Alexandre Messier, and Ali Etemad

In arXiv, 2024

Abs Bib

We introduce UPose3D, a novel approach for multi-view 3D human pose estimation, addressing challenges in accuracy and scalability. Our method advances existing pose estimation frameworks by improving robustness and flexibility without requiring direct 3D annotations. At the core of our method, a pose compiler module refines predictions from a 2D keypoints estimator that operates on a single image by leveraging temporal and cross-view information. Our novel cross-view fusion strategy is scalable to any number of cameras, while our synthetic data generation strategy ensures generalization across diverse actors, scenes, and viewpoints. Finally, UPose3D leverages the prediction uncertainty of both the 2D keypoint estimator and the pose compiler module. This provides robustness to outliers and noisy data, resulting in state-of-the-art performance in out-of-distribution settings. In addition, for in-distribution settings, UPose3D yields a performance rivaling methods that rely on 3D annotated data, while being the state-of-the-art among methods relying only on 2D supervision.
@inproceedings{Davoodnia2024, author = {Davoodnia, Vandad and Ghorbani, Saeed and Carbonneau, Marc-André and Messier, Alexandre and Etemad, Ali}, booktitle = {arXiv}, title = {UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues}, year = {2024}, }
MoSAR: Monocular Semi-Supervised Model For Avatar Reconstruction Using Differentiable Shading

Abdallah Dib, Luiz Gustavo Hafemann, Emeline Got, Trevor Anderson, Amin Fadaeinejad, and Marc-André Carbonneau

In CVPR, 2024

Abs Bib

Reconstructing an avatar from a portrait image has many applications in multimedia, but remains a challenging research problem. Extracting reflectance maps and geometry from one image is ill-posed: recovering geometry is a one-to-many mapping problem and reflectance and light are difficult to disentangle. Accurate geometry and reflectance can be captured under the controlled conditions of a light stage, but it is costly to acquire large datasets in this fashion. Moreover, training solely with this type of data leads to poor generalization with in-the-wild images. This motivates the introduction of MoSAR, a method for 3D avatar generation from monocular images. We propose a semi-supervised training scheme that improves generalization by learning from both light stage and in-the-wild datasets. This is achieved using a novel differentiable shading formulation. We show that our approach effectively disentangles the intrinsic face parameters, producing relightable avatars. As a result, MoSAR estimates a richer set of skin reflectance maps, and generates more realistic avatars than existing state-of-the-art methods.
@inproceedings{Dib2023, author = {Dib, Abdallah and Hafemann, {Luiz Gustavo} and Got, Emeline and Anderson, Trevor and Fadaeinejad, Amin and Cruz {Rafael M. O.}and Carbonneau, Marc-André}, booktitle = {{CVPR}}, title = {MoSAR: Monocular Semi-Supervised Model For Avatar Reconstruction Using Differentiable Shading}, year = {2024}, }
BinaryAlign: Word Alignment as Binary Sequence Labeling

Gaëtan Lopez Latouche, Marc-André Carbonneau, and Ben Swanson

In ACL, 2024

Abs Bib

Real world deployments of word alignment are almost certain to cover both high and low resource languages. However, the state-of-the-art for this task recommends a different model class depending on the availability of gold alignment training data for a particular language pair. We propose BinaryAlign, a novel word alignment technique based on binary sequence labeling that outperforms existing approaches in both scenarios, offering a unifying approach to the task. Additionally, we vary the specific choice of multilingual foundation model, perform stratified error analysis over alignment error type, and explore the performance of BinaryAlign on non-English language pairs. We make our source code publicly available.
@inproceedings{Lopez2024align, author = {{Lopez Latouche}, Gaëtan and Carbonneau, Marc-André and Swanson, Ben}, booktitle = {ACL}, title = {BinaryAlign: Word Alignment as Binary Sequence Labeling}, year = {2024}, }

2023

The La Forge Speech Synthesis System for Blizzard Challenge 2023

Julian Zaïdi, Corentin Duchêne, Hugo Seuté, and Marc-André Carbonneau

In 18th Blizzard Challenge Workshop - INTERSPEECH, 2023

Abs Bib

This paper describes the La Forge entry to the Blizzard Challenge of 2023 focusing on text-to-speech in French and homograph disambiguation. Our system is based on VAE-Tacotron and HiFi-GAN. We implement several improvements on the baseline models such as a cycle consistency loss for better style modeling, a style reference selection method to improve overall naturalness and an over-produce and select method that chooses the best synthesized candidate across multiple variations using automatic speech recognition. We also build a linguistic frontend capable of homograph disambiguation using part-of-speech tagging and simple rules. We publicly release our hand annotated data set for French homograph disambiguation1. Results from subjective listening tests show the effectiveness of our system in disambiguating homographs and generating high-quality synthetic speech.
@inproceedings{zaidi_2023, title = {The La Forge Speech Synthesis System for Blizzard Challenge 2023}, booktitle = {18th Blizzard Challenge Workshop - INTERSPEECH}, author = {Zaïdi, Julian and Duchêne, Corentin and Seuté, Hugo and Carbonneau, Marc-André}, year = {2023}, }
EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

Ge Zhu, Yutong Wen, Marc-André Carbonneau, and Zhiyao Duan

In NeurIPS Workshop: Machine Learning for Audio, 2023

Abs Bib

Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fréchet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data.
@inproceedings{zhu_2022, title = {EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis}, author = {Zhu, Ge and Wen, Yutong and Carbonneau, Marc-Andr{\'e} and Duan, Zhiyao}, booktitle = {{NeurIPS Workshop: Machine Learning for Audio}}, year = {2023}, }
Rhythm Modeling for Voice Conversion

Benjamin van Niekerk, Marc-André Carbonneau, and Herman Kamper

IEEE Signal Processing Letters, 2023

Abs Bib

Voice conversion aims to transform source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic—an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representations, we first divide source audio into segments approximating sonorants, obstruents, and silences. Then we model rhythm by estimating speaking rate or the duration distribution of each segment type. Finally, we match the target speaking rate or rhythm by time-stretching the speech segments. Experiments show that Urhythmic outperforms existing unsupervised methods in terms of quality and prosody.
@article{vanniekerk2023rhythm, author = {{van Niekerk}, Benjamin and Carbonneau, Marc-André and Kamper, Herman}, journal = {IEEE Signal Processing Letters}, title = {Rhythm Modeling for Voice Conversion}, year = {2023}, volume = {30}, number = {}, pages = {1297-1301}, doi = {10.1109/LSP.2023.3313515} }

ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech

Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F. Troje, and Marc-André Carbonneau

Computer Graphics Forum, 2023

Bib

@article{ZeroEGGS,
  author = {Ghorbani, Saeed and Ferstl, Ylva and Holden, Daniel and Troje, Nikolaus F. and Carbonneau, Marc-André},
  title = {ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech},
  journal = {Computer Graphics Forum},
  volume = {42},
  number = {1},
  pages = {206-216},
  keywords = {animation, gestures, character control, motion capture},
  doi = {https://doi.org/10.1111/cgf.14734},
  url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.14734},
  year = {2023},
}

2022

Measuring Disentanglement: A Review of Metrics

Marc-André Carbonneau, Julian Zaïdi, Jonathan Boilard, and Ghyslain Gagnon

IEEE Transactions on Neural Networks and Learning Systems, 2022

Abs Bib

Learning to disentangle and represent factors of variation in data is an important problem in AI. While many advances have been made to learn these representations, it is still unclear how to quantify disentanglement. While several metrics exist, little is known on their implicit assumptions, what they truly measure, and their limits. In consequence, it is difficult to interpret results when comparing different representations. In this work, we survey supervised disentanglement metrics and thoroughly analyze them. We propose a new taxonomy in which all metrics fall into one of three families: intervention-based, predictor-based and information-based. We conduct extensive experiments in which we isolate properties of disentangled representations, allowing stratified comparison along several axes. From our experiment results and analysis, we provide insights on relations between disentangled representation properties. Finally, we share guidelines on how to measure disentanglement.
@article{Carbonneau2022, author = {Carbonneau, Marc-André and Zaïdi, Julian and Boilard, Jonathan and Gagnon, Ghyslain}, journal = {IEEE Transactions on Neural Networks and Learning Systems}, title = {Measuring Disentanglement: A Review of Metrics}, year = {2022}, doi = {10.1109/TNNLS.2022.3218982}, }
A Comparaison of Discrete and Soft Speech Units for Improved Voice Conversion

Benjamin van Niekerk, Marc-André Carbonneau, Julian Zaidi, Matthew Baas, Hugo Seuté, and Herman Kamper

In ICASSP, 2022

Abs Bib

The goal of voice conversion is to transform source speech into a target voice, keeping the content unchanged. In this paper, we focus on self-supervised representation learning for voice conversion. Specifically, we compare discrete and soft speech units as input features. We find that discrete representations effectively remove speaker information but discard some linguistic content - leading to mispronunciations. As a solution, we propose soft speech units. To learn soft units, we predict a distribution over discrete speech units. By modeling uncertainty, soft units capture more content information, improving the intelligibility and naturalness of converted speech.
@inproceedings{van_niekerk_comparaison_2022, title = {A Comparaison of Discrete and Soft Speech Units for Improved Voice Conversion}, isbn = {978-1-66540-540-9}, booktitle = {{ICASSP}}, author = {{van Niekerk}, Benjamin and Carbonneau, Marc-André and Zaidi, Julian and Baas, Matthew and Seuté, Hugo and Kamper, Herman}, year = {2022}, pages = {6562--6566}, }

Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Julian Zaïdi, Hugo Seuté, Benjamin van Niekerk, and Marc-André Carbonneau

In INTERSPEECH, 2022

Bib

@inproceedings{zaidi_2022,
  title = {Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis},
  booktitle = {INTERSPEECH},
  author = {Zaïdi, Julian and Seuté, Hugo and {van Niekerk}, Benjamin and Carbonneau, Marc-André},
  year = {2022},
}

Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022

Saeed Ghorbani, Ylva Ferstl, and Marc-André Carbonneau

In International Conference on Multimodal Intreraction, 2022

Bib

@inproceedings{ghorbani_exemplar-based_2022,
  title = {Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022},
  isbn = {978-1-4503-9390-4},
  url = {https://dl.acm.org/doi/10.1145/3536221.3558068},
  doi = {10.1145/3536221.3558068},
  booktitle = {International Conference on Multimodal Intreraction},
  publisher = {ACM},
  author = {Ghorbani, Saeed and Ferstl, Ylva and Carbonneau, Marc-André},
  year = {2022},
  pages = {778--783},
}

2021

Energy Disaggregation using Variational Autoencoders

Antoine Langevin, Marc-André Carbonneau, Mohamed Cheriet, and Ghyslain Gagnon

Energy & Buildings, 2021

Bib

@article{langevin_energy_2021,
  volume = {254},
  issn = {03787788},
  doi = {10.1016/j.enbuild.2021.111623},
  journal = {Energy \& Buildings},
  author = {Langevin, Antoine and Carbonneau, Marc-André and Cheriet, Mohamed and Gagnon, Ghyslain},
  year = {2021},
  pages = {111623},
}

Artist guided generation of video game production quality face textures

Christian Murphy, Sudhir Mudur, Daniel Holden, Marc-André Carbonneau, Donya Ghafourzadeh, and Andre Beauchamp

Computers & Graphics, 2021

Bib

@article{MURPHY2021268,
  title = {Artist guided generation of video game production quality face textures},
  journal = {Computers & Graphics},
  volume = {98},
  pages = {268-279},
  year = {2021},
  issn = {0097-8493},
  doi = {https://doi.org/10.1016/j.cag.2021.06.004},
  url = {https://www.sciencedirect.com/science/article/pii/S0097849321001199},
  author = {Murphy, Christian and Mudur, Sudhir and Holden, Daniel and Carbonneau, Marc-André and Ghafourzadeh, Donya and Beauchamp, Andre},
  keywords = {Face texture generation, Artist guided, SuperResolution, BRDF recovery},
}

2020

Feature Learning from Spectrograms for Assessment of Personality Traits

Marc-André Carbonneau, Eric Granger, Yazid Attabi, and Ghyslain Gagnon

IEEE Transactions on Affective Computing, 2020

Bib

@article{Carbonneau2020_personality,
  author = {Carbonneau, Marc-André and Granger, Eric and Attabi, Yazid and Gagnon, Ghyslain},
  journal = {IEEE Transactions on Affective Computing},
  title = {Feature Learning from Spectrograms for Assessment of Personality Traits},
  year = {2020},
  volume = {11},
  number = {1},
  pages = {25-31},
  doi = {10.1109/TAFFC.2017.2763132}
}

Appearance Controlled Face Texture Generation for Video Game Characters

Christian Murphy, Sudhir Mudur, Daniel Holden, Marc-André Carbonneau, Donya Ghafourzadeh, and Andre Beauchamp

In Proceedings of the 13th ACM SIGGRAPH Conference on Motion, Interaction and Games, 2020

Abs Bib

Manually creating realistic, digital human heads is a difficult and time-consuming task for artists. While 3D scanners and photogrammetry allow for quick and automatic reconstruction of heads, finding an actor who fits specific character appearance descriptions can be difficult. Moreover, modern open-world videogames feature several thousands of characters that cannot realistically all be cast and scanned. Therefore, researchers are investigating generative models to create heads fitting a specific character appearance description. While current methods are able to generate believable head shapes quite well, generating a corresponding high-resolution and high-quality texture which respects the character’s appearance description is not possible using current state of the art methods. This work presents a method that generates synthetic face textures under the following constraints: (i) there is no reference photograph to build the texture, (ii) game artists control the generative process by providing precise appearance attributes, the face shape, and the character’s age and gender, and (iii) the texture must be of adequately high resolution and look believable when applied to the given face shape. Our method builds upon earlier deep learning approaches addressing similar problems. We propose several key additions to these methods to be able to use them in our context, specifically for artist control and small training data. In spite of training with a limited amount of training data, just over 100 samples, our model produces realistic textures which comply to a diverse range of skin, hair, lip and iris colors specified through our intuitive description format and augmentation thereof.
@inproceedings{10.1145/3424636.3426898, author = {Murphy, Christian and Mudur, Sudhir and Holden, Daniel and Carbonneau, Marc-Andr\'{e} and Ghafourzadeh, Donya and Beauchamp, Andre}, title = {Appearance Controlled Face Texture Generation for Video Game Characters}, year = {2020}, isbn = {9781450381710}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3424636.3426898}, doi = {10.1145/3424636.3426898}, booktitle = {Proceedings of the 13th ACM SIGGRAPH Conference on Motion, Interaction and Games}, articleno = {9}, numpages = {11}, keywords = {artist controlled character creation, image-to-image translation, face texture generation, fine facial features}, location = {Virtual Event, SC, USA}, series = {MIG '20}, }

2019

Bag-Level Aggregation for Multiple-Instance Active Learning in Instance Classification Problems

Marc-André Carbonneau, Eric Granger, and Ghyslain Gagnon

IEEE Transactions on Neural Networks and Learning Systems, 2019

Bib

@article{carbonneau_bag-level_2019,
  author = {Carbonneau, Marc-André and Granger, Eric and Gagnon, Ghyslain},
  journal = {IEEE Transactions on Neural Networks and Learning Systems},
  title = {Bag-Level Aggregation for Multiple-Instance Active Learning in Instance Classification Problems},
  year = {2019},
  volume = {30},
  number = {5},
  pages = {1441-1451},
  doi = {10.1109/TNNLS.2018.2869164}
}

2018

Multiple instance learning: A survey of problem characteristics and applications

Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon

Pattern Recognition, 2018

Bib

@article{Carbonneau2016Survey,
  title = {Multiple instance learning: A survey of problem characteristics and applications},
  journal = {Pattern Recognition},
  volume = {77},
  pages = {329-353},
  year = {2018},
  issn = {0031-3203},
  doi = {https://doi.org/10.1016/j.patcog.2017.10.009},
  url = {https://www.sciencedirect.com/science/article/pii/S0031320317304065},
  author = {Carbonneau, Marc-André and Cheplygina, Veronika and Granger, Eric and Gagnon, Ghyslain},
  keywords = {Multiple instance learning, Weakly supervised learning, Classification, Multi-instance learning, Computer vision, Computer aided diagnosis, Document classification, Drug activity prediction},
}

2016

Score thresholding for accurate instance classification in multiple instance learning

Marc-André Carbonneau, Eric Granger, and Ghyslain Gagnon

In 2016 Sixth International Conference on Image Processing Theory, Tools and Applications (IPTA), 2016

Bib

@inproceedings{carbonneau_ipta2016,
  author = {Carbonneau, Marc-André and Granger, Eric and Gagnon, Ghyslain},
  booktitle = {2016 Sixth International Conference on Image Processing Theory, Tools and Applications (IPTA)},
  title = {Score thresholding for accurate instance classification in multiple instance learning},
  year = {2016},
  doi = {10.1109/IPTA.2016.7821026}
}

Robust multiple-instance learning ensembles using random subspace instance selection

Marc-André Carbonneau, Eric Granger, Alexandre J. Raymond, and Ghyslain Gagnon

Pattern Recognition, 2016

Bib

@article{CARBONNEAU_RSIS,
  title = {Robust multiple-instance learning ensembles using random subspace instance selection},
  journal = {Pattern Recognition},
  volume = {58},
  pages = {83-99},
  year = {2016},
  issn = {0031-3203},
  doi = {https://doi.org/10.1016/j.patcog.2016.03.035},
  url = {https://www.sciencedirect.com/science/article/pii/S0031320316300346},
  author = {Carbonneau, Marc-André and Granger, Eric and Raymond, Alexandre J. and Gagnon, Ghyslain},
  keywords = {Multiple-instance learning, Random subspace methods, Classifier ensembles, Instance selection, Weakly supervised learning, Classification, MIL},
}

Witness identification in multiple instance learning using random subspaces

Marc-André Carbonneau, Eric Granger, and Ghyslain Gagnon

In 2016 23rd International Conference on Pattern Recognition (ICPR), 2016

Bib

@inproceedings{Carbonneau_icpr,
  author = {Carbonneau, Marc-André and Granger, Eric and Gagnon, Ghyslain},
  booktitle = {2016 23rd International Conference on Pattern Recognition (ICPR)},
  title = {Witness identification in multiple instance learning using random subspaces},
  year = {2016},
  volume = {},
  number = {},
  pages = {3639-3644},
  doi = {10.1109/ICPR.2016.7900199}
}

2015

Real-time visual play-break detection in sport events using a context descriptor

Marc-André Carbonneau, Alexandre J. Raymond, Eric Granger, and Ghyslain Gagnon

In Circuits and Systems (ISCAS), 2015 IEEE International Symposium on, May 2015

Bib

@inproceedings{Carbonneau2015_hockey,
  title = {Real-time visual play-break detection in sport events using a context descriptor},
  doi = {10.1109/ISCAS.2015.7169270},
  booktitle = {Circuits and {Systems} ({ISCAS}), 2015 {IEEE} {International} {Symposium} on},
  author = {Carbonneau, Marc-André and Raymond, Alexandre J. and Granger, Eric and Gagnon, Ghyslain},
  month = may,
  year = {2015},
  pages = {2808--2811}
}

2013

Detection of alarms and warning signals on an digital in-ear device

Marc-André Carbonneau, Narimène Lezzoum, Jérémie Voix, and Ghyslain Gagnon

International Journal of Industrial Ergonomics, May 2013

Noise: Assessment & Control

Bib

@article{CARBONNEAU_ALARM,
  title = {Detection of alarms and warning signals on an digital in-ear device},
  journal = {International Journal of Industrial Ergonomics},
  volume = {43},
  number = {6},
  pages = {503-511},
  year = {2013},
  note = {Noise: Assessment & Control},
  issn = {0169-8141},
  doi = {https://doi.org/10.1016/j.ergon.2012.07.001},
  url = {https://www.sciencedirect.com/science/article/pii/S0169814112000625},
  author = {Carbonneau, Marc-André and Lezzoum, Narimène and Voix, Jérémie and Gagnon, Ghyslain},
  keywords = {Pattern recognition, Digital signal processing, Hearing protection devices, Industrial worker safety},
}

Recognition of blowing sound types for real-time implementation in mobile devices

Marc-André Carbonneau, Ghyslain Gagnon, Robert Sabourin, and Jean Dubois

In 2013 IEEE 11th International New Circuits and Systems Conference (NEWCAS), May 2013

Bib

@inproceedings{blow,
  author = {Carbonneau, Marc-André and Gagnon, Ghyslain and Sabourin, Robert and Dubois, Jean},
  booktitle = {2013 IEEE 11th International New Circuits and Systems Conference (NEWCAS)},
  title = {Recognition of blowing sound types for real-time implementation in mobile devices},
  year = {2013},
  pages = {1-4},
  doi = {10.1109/NEWCAS.2013.6573655}
}