Marc-André Carbonneau

Welcome to my page!

This site contains information regarding my research and some personal projects. Here’s a subset of my research interests:

Machine learning
Speech processing
Disentangled representations
Multiple instance learning
Computer vision

I act as principal research scientist at Ubisoft in the La Forge lab. I work there since 2017. I lead a group of resarchers applying the latest techniques in machine learning, speech, signal processing, computer vision & graphics, animation to video games.

Before that, as a PhD student, I was affiliated with two labs:

news

Apr 1, 2024	We are excited to share our recent work on monocular 3D face reconstruction that will be presented at CVPR 2024. We introduce MoSAR, a new method that turns a portrait image into a realistic 3D avatar. From a single image, MoSAR estimates a detailed mesh and texture maps at 4K resolution, capturing pore-level details. This avatar can be rendered from any viewpoint and under different lighting condition. We are also releasing a new dataset called FFHQ-UV-Intrinsics. This is the first dataset that offer rich intrinsic face attributes (diffuse, specular, ambient occlusion and translucency) at high resolution for 10K subjects. Check out the project page!
Oct 27, 2023	Our paper EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis has been accepted for presentation at the NeurIPS Workshop on ML for Audio. This work has been done in collaboration with colleagues from Rochester University. In this paper, we propose a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate duplication of the training data. Check out the project page!
Sep 21, 2023	Our paper “Rhythm Modeling for Voice Conversion” has been published in IEEE Signal Processing Letters. We also released it on Arxiv. In this paper we model the natural rhythm of speakers to perform conversion while respecting the target speaker’s natural rhythm. We do more than approximating the global speech rate, we model duration for sonorants, obstruents, and silences. Check out the demo page!
Jul 15, 2023	Ubisoft had published a blog page describing our system for gesture generation conditioned on speech. This system was presented in “ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech” and showcased on 2 minute papers.
Jul 21, 2021	This is the recording of the presentation that I gave at the 2021 Game Developers Conference on “speech synthesis applied to videogames”. Generating spoken dialog lines artificially could prove to be pivotal for the future of the gaming industry. Aside from reducing production costs, it offers opportunities for new types of in-games interactions closer to real-world experiences. The goal of the talk is to present an honest snapshot of the state of the technology, discuss remaining challenges and possible present and future use cases. We demonstrate how current commercial speech synthesis solutions do not directly apply to the gaming context where voice require a high level of expressivity. We discuss present solutions to control expressivity, and how we use speech synthesis at Ubisoft.

selected publications

Spoken-Term Discovery using Discrete Speech Units

Benjamin van Niekerk, Julian Zaïdi, Marc-André Carbonneau, and Herman Kamper

In INTERSPEECH, 2024

Abs Bib

Discovering a lexicon from unlabeled audio is a longstanding challenge for zero-resource speech processing. One approach is to search for frequently occurring patterns in speech. We revisit this idea by proposing DUSTED: Discrete Unit Spoken-TErm Discovery. Leveraging self-supervised models, we encode input audio into sequences of discrete units. Inspired by alignment algorithms from bioinformatics, we find repeated speech patterns by searching for similar sub-sequences of units. Since discretization discards speaker information, DUSTED finds better matches across speakers, improving the coverage and consistency of the discovered patterns. We demonstrate these improvements on the ZeroSpeech Challenge, achieving state-of-the-art results on the spoken-term discovery track. Finally, we analyze the duration distribution of the patterns, showing that our method finds longer word- or phrase-like terms.
@inproceedings{vanniekerk2024WD, author = {{van Niekerk}, Benjamin and Zaïdi, Julian and Carbonneau, Marc-André and Kamper, Herman}, booktitle = {{INTERSPEECH}}, title = {Spoken-Term Discovery using Discrete Speech Units}, year = {2024}, }
MoSAR: Monocular Semi-Supervised Model For Avatar Reconstruction Using Differentiable Shading

Abdallah Dib, Luiz Gustavo Hafemann, Emeline Got, Trevor Anderson, Amin Fadaeinejad, and Marc-André Carbonneau

In CVPR, 2024

Abs Bib

Reconstructing an avatar from a portrait image has many applications in multimedia, but remains a challenging research problem. Extracting reflectance maps and geometry from one image is ill-posed: recovering geometry is a one-to-many mapping problem and reflectance and light are difficult to disentangle. Accurate geometry and reflectance can be captured under the controlled conditions of a light stage, but it is costly to acquire large datasets in this fashion. Moreover, training solely with this type of data leads to poor generalization with in-the-wild images. This motivates the introduction of MoSAR, a method for 3D avatar generation from monocular images. We propose a semi-supervised training scheme that improves generalization by learning from both light stage and in-the-wild datasets. This is achieved using a novel differentiable shading formulation. We show that our approach effectively disentangles the intrinsic face parameters, producing relightable avatars. As a result, MoSAR estimates a richer set of skin reflectance maps, and generates more realistic avatars than existing state-of-the-art methods.
@inproceedings{Dib2023, author = {Dib, Abdallah and Hafemann, {Luiz Gustavo} and Got, Emeline and Anderson, Trevor and Fadaeinejad, Amin and Cruz {Rafael M. O.}and Carbonneau, Marc-André}, booktitle = {{CVPR}}, title = {MoSAR: Monocular Semi-Supervised Model For Avatar Reconstruction Using Differentiable Shading}, year = {2024}, }
BinaryAlign: Word Alignment as Binary Sequence Labeling

Gaëtan Lopez Latouche, Marc-André Carbonneau, and Ben Swanson

In ACL, 2024

Abs Bib

Real world deployments of word alignment are almost certain to cover both high and low resource languages. However, the state-of-the-art for this task recommends a different model class depending on the availability of gold alignment training data for a particular language pair. We propose BinaryAlign, a novel word alignment technique based on binary sequence labeling that outperforms existing approaches in both scenarios, offering a unifying approach to the task. Additionally, we vary the specific choice of multilingual foundation model, perform stratified error analysis over alignment error type, and explore the performance of BinaryAlign on non-English language pairs. We make our source code publicly available.
@inproceedings{Lopez2024align, author = {{Lopez Latouche}, Gaëtan and Carbonneau, Marc-André and Swanson, Ben}, booktitle = {ACL}, title = {BinaryAlign: Word Alignment as Binary Sequence Labeling}, year = {2024}, }
EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

Ge Zhu, Yutong Wen, Marc-André Carbonneau, and Zhiyao Duan

In NeurIPS Workshop: Machine Learning for Audio, 2023

Abs Bib

Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fréchet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data.
@inproceedings{zhu_2022, title = {EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis}, author = {Zhu, Ge and Wen, Yutong and Carbonneau, Marc-Andr{\'e} and Duan, Zhiyao}, booktitle = {{NeurIPS Workshop: Machine Learning for Audio}}, year = {2023}, }
Measuring Disentanglement: A Review of Metrics

Marc-André Carbonneau, Julian Zaïdi, Jonathan Boilard, and Ghyslain Gagnon

IEEE Transactions on Neural Networks and Learning Systems, 2022

Abs Bib

Learning to disentangle and represent factors of variation in data is an important problem in AI. While many advances have been made to learn these representations, it is still unclear how to quantify disentanglement. While several metrics exist, little is known on their implicit assumptions, what they truly measure, and their limits. In consequence, it is difficult to interpret results when comparing different representations. In this work, we survey supervised disentanglement metrics and thoroughly analyze them. We propose a new taxonomy in which all metrics fall into one of three families: intervention-based, predictor-based and information-based. We conduct extensive experiments in which we isolate properties of disentangled representations, allowing stratified comparison along several axes. From our experiment results and analysis, we provide insights on relations between disentangled representation properties. Finally, we share guidelines on how to measure disentanglement.
@article{Carbonneau2022, author = {Carbonneau, Marc-André and Zaïdi, Julian and Boilard, Jonathan and Gagnon, Ghyslain}, journal = {IEEE Transactions on Neural Networks and Learning Systems}, title = {Measuring Disentanglement: A Review of Metrics}, year = {2022}, doi = {10.1109/TNNLS.2022.3218982}, }
Rhythm Modeling for Voice Conversion

Benjamin van Niekerk, Marc-André Carbonneau, and Herman Kamper

IEEE Signal Processing Letters, 2023

Abs Bib

Voice conversion aims to transform source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic—an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representations, we first divide source audio into segments approximating sonorants, obstruents, and silences. Then we model rhythm by estimating speaking rate or the duration distribution of each segment type. Finally, we match the target speaking rate or rhythm by time-stretching the speech segments. Experiments show that Urhythmic outperforms existing unsupervised methods in terms of quality and prosody.
@article{vanniekerk2023rhythm, author = {{van Niekerk}, Benjamin and Carbonneau, Marc-André and Kamper, Herman}, journal = {IEEE Signal Processing Letters}, title = {Rhythm Modeling for Voice Conversion}, year = {2023}, volume = {30}, number = {}, pages = {1297-1301}, doi = {10.1109/LSP.2023.3313515} }
A Comparaison of Discrete and Soft Speech Units for Improved Voice Conversion

Benjamin van Niekerk, Marc-André Carbonneau, Julian Zaidi, Matthew Baas, Hugo Seuté, and Herman Kamper

In ICASSP, 2022

Abs Bib

The goal of voice conversion is to transform source speech into a target voice, keeping the content unchanged. In this paper, we focus on self-supervised representation learning for voice conversion. Specifically, we compare discrete and soft speech units as input features. We find that discrete representations effectively remove speaker information but discard some linguistic content - leading to mispronunciations. As a solution, we propose soft speech units. To learn soft units, we predict a distribution over discrete speech units. By modeling uncertainty, soft units capture more content information, improving the intelligibility and naturalness of converted speech.
@inproceedings{van_niekerk_comparaison_2022, title = {A Comparaison of Discrete and Soft Speech Units for Improved Voice Conversion}, isbn = {978-1-66540-540-9}, booktitle = {{ICASSP}}, author = {{van Niekerk}, Benjamin and Carbonneau, Marc-André and Zaidi, Julian and Baas, Matthew and Seuté, Hugo and Kamper, Herman}, year = {2022}, pages = {6562--6566}, }

Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Julian Zaïdi, Hugo Seuté, Benjamin van Niekerk, and Marc-André Carbonneau

In INTERSPEECH, 2022

Bib

@inproceedings{zaidi_2022,
  title = {Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis},
  booktitle = {INTERSPEECH},
  author = {Zaïdi, Julian and Seuté, Hugo and {van Niekerk}, Benjamin and Carbonneau, Marc-André},
  year = {2022},
}

Multiple instance learning: A survey of problem characteristics and applications

Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon

Pattern Recognition, 2018

Bib

@article{Carbonneau2016Survey,
  title = {Multiple instance learning: A survey of problem characteristics and applications},
  journal = {Pattern Recognition},
  volume = {77},
  pages = {329-353},
  year = {2018},
  issn = {0031-3203},
  doi = {https://doi.org/10.1016/j.patcog.2017.10.009},
  url = {https://www.sciencedirect.com/science/article/pii/S0031320317304065},
  author = {Carbonneau, Marc-André and Cheplygina, Veronika and Granger, Eric and Gagnon, Ghyslain},
  keywords = {Multiple instance learning, Weakly supervised learning, Classification, Multi-instance learning, Computer vision, Computer aided diagnosis, Document classification, Drug activity prediction},
}

ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech

Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F. Troje, and Marc-André Carbonneau

Computer Graphics Forum, 2023

Bib

@article{ZeroEGGS,
  author = {Ghorbani, Saeed and Ferstl, Ylva and Holden, Daniel and Troje, Nikolaus F. and Carbonneau, Marc-André},
  title = {ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech},
  journal = {Computer Graphics Forum},
  volume = {42},
  number = {1},
  pages = {206-216},
  keywords = {animation, gestures, character control, motion capture},
  doi = {https://doi.org/10.1111/cgf.14734},
  url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.14734},
  year = {2023},
}