Modeling voice identity is challenging due to its multifaceted nature. In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings, designed for discrimination rather than characterizing identity. This paper investigates which aspects of a voice are captured in such representations. We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm. We also identify confounding factors that compromise speaker similarity measurements and suggest mitigation strategies. To address these gaps, we propose U3D, a metric that evaluates speakers’ dynamic rhythm patterns. This work contributes to the ongoing challenge of assessing speaker identity consistency in the context of ever-better voice cloning systems. We publicly release our code.
@inproceedings{carbonneau2025analyzing,title={Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis},author={Carbonneau, Marc-André and van Niekerk, Benjamin and Seuté, Hugo and Letendre, Jean-Philippe and Kamper, Herman and Zaïdi, Julian},booktitle={ISCA Speech Synthesis Workshop},year={2025}}
LinearVC: Linear transformations of self-supervised features through the lens of voice conversion
Herman Kamper, Benjamin Niekerk, Julian Zaïdi, and Marc-André Carbonneau
We introduce LinearVC, a simple voice conversion method that sheds light on the structure of self-supervised representations. First, we show that simple linear transformations of self-supervised features effectively convert voices. Next, we probe the geometry of the feature space by constraining the set of allowed transformations. We find that just rotating the features is sufficient for high-quality voice conversion. This suggests that content information is embedded in a low-dimensional subspace which can be linearly transformed to produce a target voice. To validate this hypothesis, we finally propose a method that explicitly factorizes content and speaker information using singular value decomposition; the resulting linear projection with a rank of just 100 gives competitive conversion results. Our work has implications for both practical voice conversion and a broader understanding of self-supervised speech representations. Samples and code: https://www.kamperh.com/linearvc/
@inproceedings{kamperlinearvc,title={LinearVC: Linear transformations of self-supervised features through the lens of voice conversion},author={Kamper, Herman and van Niekerk, Benjamin and Zaïdi, Julian and Carbonneau, Marc-André},booktitle={{Interspeech}},year={2025}}
SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting
Monocular facial performance capture in-the-wild is challenging due to varied capture conditions, face shapes, and expressions. Most current methods rely on linear 3D Morphable Models, which represent facial expressions independently of identity at the vertex displacement level. We propose SEREP (Semantic Expression Representation), a model that disentangles expression from identity at the semantic level. It first learns an expression representation from unpaired 3D facial expressions using a cycle consistency loss. Then we train a model to predict expression from monocular images using a novel semi-supervised scheme that relies on domain adaptation. In addition, we introduce MultiREX, a benchmark addressing the lack of evaluation resources for the expression capture task. Our experiments show that SEREP outperforms state-of-the-art methods, capturing challenging expressions and transferring them to novel identities.
@inproceedings{josi_2024,title={SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting},booktitle={ArXiv},author={Josi, Arthur and Hafemann, Luiz Gustavo and Dib, Abdallah and Got, Emeline and Cruz, Rafael MO and Carbonneau, Marc-André},year={2025},}
Geometry-Aware Texture Generation for 3D Head Modeling with Artist-driven Control
Amin Fadaeinejad, Abdallah Dib, Luiz Gustavo Hafemann, Emeline Got, Trevor Anderson, Amaury Depierre, Nikolaus F. Troje, Marcus A Brubaker, and Marc-André Carbonneau
In 6th AI for Creative Visual Content Generation Editing and Understanding (CVEU) - CVPR, 2025
Fulfilling a precise artistic vision while creating realistic virtual human characters requires extensive manual efforts. This paper proposes a novel approach to streamline this process, generating 3D head geometry and enabling precise control over skin tone and fine-grained modification of facial details such as wrinkles. User-specified modifications are conveniently propagated over the entire assets by our models, effectively reducing the amount of manual intervention needed to achieve a specific artistic vision. This is achieved by our proposed texture-generation pipeline that leverages correlations between texture and geometry for different head shapes, ethnicity, and gender. Our method allows for accurate skin-tone control while keeping the other appearance factors unchanged. Lastly, we introduce a method for fine-grained control over the details of the generated heads, which enables artists to freely modify one texture map and have changes cohesively propagated to the other maps. Our experiments show that our method produces diverse and well-behaved geometries, thanks to our GNN-based model, and synthesizes textures that are coherent with the geometry using a CNN-based GAN. We also achieve precise and intuitive skin-tone control through a single control parameter and obtain plausible textures for both face skin and lips. Our experiments with fine-grained editing on common artists’ tasks, such as adding wrinkles or removing a beard, showcase how our method simplifies the head generation workflow by cohesively propagating changes to all texture maps.
@inproceedings{Fadaeinejad2025,author={Fadaeinejad, Amin and Dib, Abdallah and Hafemann, {Luiz Gustavo} and Got, Emeline and Anderson, Trevor and Depierre, Amaury and Troje, Nikolaus F. and Brubaker, Marcus A and and Carbonneau, Marc-André},booktitle={6th AI for Creative Visual Content Generation Editing and Understanding (CVEU) - CVPR},title={Geometry-Aware Texture Generation for 3D Head Modeling with Artist-driven Control},year={2025},}
Discovering a lexicon from unlabeled audio is a longstanding challenge for zero-resource speech processing. One approach is to search for frequently occurring patterns in speech. We revisit this idea by proposing DUSTED: Discrete Unit Spoken-TErm Discovery. Leveraging self-supervised models, we encode input audio into sequences of discrete units. Inspired by alignment algorithms from bioinformatics, we find repeated speech patterns by searching for similar sub-sequences of units. Since discretization discards speaker information, DUSTED finds better matches across speakers, improving the coverage and consistency of the discovered patterns. We demonstrate these improvements on the ZeroSpeech Challenge, achieving state-of-the-art results on the spoken-term discovery track. Finally, we analyze the duration distribution of the patterns, showing that our method finds longer word- or phrase-like terms.
@inproceedings{vanniekerk2024WD,author={{van Niekerk}, Benjamin and Zaïdi, Julian and Carbonneau, Marc-André and Kamper, Herman},booktitle={{Interspeech}},title={Spoken-Term Discovery using Discrete Speech Units},year={2024},}
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues
Vandad Davoodnia, Saeed Ghorbani, Marc-André Carbonneau, Alexandre Messier, and Ali Etemad
We introduce UPose3D, a novel approach for multi-view 3D human pose estimation, addressing challenges in accuracy and scalability. Our method advances existing pose estimation frameworks by improving robustness and flexibility without requiring direct 3D annotations. At the core of our method, a pose compiler module refines predictions from a 2D keypoints estimator that operates on a single image by leveraging temporal and cross-view information. Our novel cross-view fusion strategy is scalable to any number of cameras, while our synthetic data generation strategy ensures generalization across diverse actors, scenes, and viewpoints. Finally, UPose3D leverages the prediction uncertainty of both the 2D keypoint estimator and the pose compiler module. This provides robustness to outliers and noisy data, resulting in state-of-the-art performance in out-of-distribution settings. In addition, for in-distribution settings, UPose3D yields a performance rivaling methods that rely on 3D annotated data, while being the state-of-the-art among methods relying only on 2D supervision.
@inproceedings{Davoodnia2024,author={Davoodnia, Vandad and Ghorbani, Saeed and Carbonneau, Marc-André and Messier, Alexandre and Etemad, Ali},booktitle={ECCV},title={UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues},year={2024},}
MoSAR: Monocular Semi-Supervised Model For Avatar Reconstruction Using Differentiable Shading
Reconstructing an avatar from a portrait image has many applications in multimedia, but remains a challenging research problem. Extracting reflectance maps and geometry from one image is ill-posed: recovering geometry is a one-to-many mapping problem and reflectance and light are difficult to disentangle. Accurate geometry and reflectance can be captured under the controlled conditions of a light stage, but it is costly to acquire large datasets in this fashion. Moreover, training solely with this type of data leads to poor generalization with in-the-wild images. This motivates the introduction of MoSAR, a method for 3D avatar generation from monocular images. We propose a semi-supervised training scheme that improves generalization by learning from both light stage and in-the-wild datasets. This is achieved using a novel differentiable shading formulation. We show that our approach effectively disentangles the intrinsic face parameters, producing relightable avatars. As a result, MoSAR estimates a richer set of skin reflectance maps, and generates more realistic avatars than existing state-of-the-art methods.
@inproceedings{Dib2023,author={Dib, Abdallah and Hafemann, {Luiz Gustavo} and Got, Emeline and Anderson, Trevor and Fadaeinejad, Amin and Cruz {Rafael M. O.}and Carbonneau, Marc-André},booktitle={{CVPR}},title={MoSAR: Monocular Semi-Supervised Model For Avatar Reconstruction Using Differentiable Shading},year={2024},}
BinaryAlign: Word Alignment as Binary Sequence Labeling
Gaëtan Lopez Latouche, Marc-André Carbonneau, and Ben Swanson
Real world deployments of word alignment are almost certain to cover both high and low resource languages. However, the state-of-the-art for this task recommends a different model class depending on the availability of gold alignment training data for a particular language pair. We propose BinaryAlign, a novel word alignment technique based on binary sequence labeling that outperforms existing approaches in both scenarios, offering a unifying approach to the task. Additionally, we vary the specific choice of multilingual foundation model, perform stratified error analysis over alignment error type, and explore the performance of BinaryAlign on non-English language pairs. We make our source code publicly available.
@inproceedings{Lopez2024align,author={{Lopez Latouche}, Gaëtan and Carbonneau, Marc-André and Swanson, Ben},booktitle={ACL},title={BinaryAlign: Word Alignment as Binary Sequence Labeling},year={2024},}
Zero-shot Cross-Lingual Transfer for Synthetic Data Generation in Grammatical Error Detection
Gaëtan Lopez Latouche, Marc-André Carbonneau, and Ben Swanson
Grammatical Error Detection (GED) methods rely heavily on human annotated error corpora. However, these annotations are unavailable in many low-resource languages. In this paper, we investigate GED in this context. Leveraging the zero-shot cross-lingual transfer capabilities of multilingual pre-trained language models, we train a model using data from a diverse set of languages to generate synthetic errors in other languages. These synthetic error corpora are then used to train a GED model. Specifically we propose a two-stage fine-tuning pipeline where the GED model is first fine-tuned on multilingual synthetic data from target languages followed by fine-tuning on human-annotated GED corpora from source languages. This approach outperforms current state-of-the-art annotation-free GED methods. We also analyse the errors produced by our method and other strong baselines, finding that our approach produces errors that are more diverse and more similar to human errors.
@inproceedings{Lopez2024ZCLT,author={{Lopez Latouche}, Gaëtan and Carbonneau, Marc-André and Swanson, Ben},booktitle={EMNLP},title={Zero-shot Cross-Lingual Transfer for Synthetic Data Generation in Grammatical Error Detection},year={2024},}
2023
The La Forge Speech Synthesis System for Blizzard Challenge 2023
Julian Zaïdi, Corentin Duchêne, Hugo Seuté, and Marc-André Carbonneau
In 18th Blizzard Challenge Workshop - Interspeech, 2023
This paper describes the La Forge entry to the Blizzard Challenge of 2023 focusing on text-to-speech in French and homograph disambiguation. Our system is based on VAE-Tacotron and HiFi-GAN. We implement several improvements on the baseline models such as a cycle consistency loss for better style modeling, a style reference selection method to improve overall naturalness and an over-produce and select method that chooses the best synthesized candidate across multiple variations using automatic speech recognition. We also build a linguistic frontend capable of homograph disambiguation using part-of-speech tagging and simple rules. We publicly release our hand annotated data set for French homograph disambiguation1. Results from subjective listening tests show the effectiveness of our system in disambiguating homographs and generating high-quality synthetic speech.
@inproceedings{zaidi_2023,title={The La Forge Speech Synthesis System for Blizzard Challenge 2023},booktitle={18th Blizzard Challenge Workshop - Interspeech},author={Zaïdi, Julian and Duchêne, Corentin and Seuté, Hugo and Carbonneau, Marc-André},year={2023},}
EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fréchet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data.
@inproceedings{zhu_2022,title={EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis},author={Zhu, Ge and Wen, Yutong and Carbonneau, Marc-Andr{\'e} and Duan, Zhiyao},booktitle={{NeurIPS Workshop: Machine Learning for Audio}},year={2023},}
Voice conversion aims to transform source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic—an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representations, we first divide source audio into segments approximating sonorants, obstruents, and silences. Then we model rhythm by estimating speaking rate or the duration distribution of each segment type. Finally, we match the target speaking rate or rhythm by time-stretching the speech segments. Experiments show that Urhythmic outperforms existing unsupervised methods in terms of quality and prosody.
@article{vanniekerk2023rhythm,author={{van Niekerk}, Benjamin and Carbonneau, Marc-André and Kamper, Herman},journal={IEEE Signal Processing Letters},title={Rhythm Modeling for Voice Conversion},year={2023},volume={30},pages={1297-1301},doi={10.1109/LSP.2023.3313515}}
ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech
We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example. This means style can be controlled via only a short example motion clip, even for motion styles unseen during training. Our model uses a Variational framework to learn a style embedding, making it easy to modify style through latent space manipulation or blending and scaling of style embeddings. The probabilistic nature of our framework further enables the generation of a variety of outputs given the input, addressing the stochastic nature of gesture motion. In a series of experiments, we first demonstrate the flexibility and generalizability of our model to new speakers and styles. In a user study, we then show that our model outperforms previous state-of-the-art techniques in naturalness of motion, appropriateness for speech, and style portrayal. Finally, we release a high-quality dataset of full-body gesture motion including fingers, with speech, spanning across 19 different styles. Our code and data are publicly available at https:// github.com/ ubisoft/ ubisoft-laforge-ZeroEGGS.
@article{ZeroEGGS,author={Ghorbani, Saeed and Ferstl, Ylva and Holden, Daniel and Troje, Nikolaus F. and Carbonneau, Marc-André},title={ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech},journal={Computer Graphics Forum},volume={42},number={1},pages={206-216},keywords={animation, gestures, character control, motion capture},doi={https://doi.org/10.1111/cgf.14734},url={https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.14734},year={2023},}
2022
Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis
This paper presents Daft-Exprt, a multi-speaker acoustic model advancing the state-of-the-art for cross-speaker prosody transfer on any text. This is one of the most challenging, and rarely directly addressed, task in speech synthesis, especially for highly expressive data. Daft-Exprt uses FiLM conditioning layers to strategically inject different prosodic information in all parts of the architecture. The model explicitly encodes traditional low-level prosody features such as pitch, loudness and duration, but also higher level prosodic information that helps generating convincing voices in highly expressive styles. Speaker identity and prosodic information are disentangled through an adversarial training strategy that enables accurate prosody transfer across speakers. Experimental results show that Daft-Exprt significantly outperforms strong baselines on inter-text cross-speaker prosody transfer tasks, while yielding naturalness comparable to state-of-the-art expressive models. Moreover, results indicate that the model discards speaker identity information from the prosody representation, and consistently generate speech with the desired voice. We publicly release our code and provide speech samples from our experiments.
@inproceedings{zaidi_2022,title={Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis},booktitle={Interspeech},author={Zaïdi, Julian and Seuté, Hugo and {van Niekerk}, Benjamin and Carbonneau, Marc-André},year={2022},}
Measuring Disentanglement: A Review of Metrics
Marc-André Carbonneau, Julian Zaïdi, Jonathan Boilard, and Ghyslain Gagnon
IEEE Transactions on Neural Networks and Learning Systems, 2022
Learning to disentangle and represent factors of variation in data is an important problem in AI. While many advances have been made to learn these representations, it is still unclear how to quantify disentanglement. While several metrics exist, little is known on their implicit assumptions, what they truly measure, and their limits. In consequence, it is difficult to interpret results when comparing different representations. In this work, we survey supervised disentanglement metrics and thoroughly analyze them. We propose a new taxonomy in which all metrics fall into one of three families: intervention-based, predictor-based and information-based. We conduct extensive experiments in which we isolate properties of disentangled representations, allowing stratified comparison along several axes. From our experiment results and analysis, we provide insights on relations between disentangled representation properties. Finally, we share guidelines on how to measure disentanglement.
@article{Carbonneau2022,author={Carbonneau, Marc-André and Zaïdi, Julian and Boilard, Jonathan and Gagnon, Ghyslain},journal={IEEE Transactions on Neural Networks and Learning Systems},title={Measuring Disentanglement: A Review of Metrics},year={2022},doi={10.1109/TNNLS.2022.3218982},}
A Comparaison of Discrete and Soft Speech Units for Improved Voice Conversion
The goal of voice conversion is to transform source speech into a target voice, keeping the content unchanged. In this paper, we focus on self-supervised representation learning for voice conversion. Specifically, we compare discrete and soft speech units as input features. We find that discrete representations effectively remove speaker information but discard some linguistic content - leading to mispronunciations. As a solution, we propose soft speech units. To learn soft units, we predict a distribution over discrete speech units. By modeling uncertainty, soft units capture more content information, improving the intelligibility and naturalness of converted speech.
@inproceedings{van_niekerk_comparaison_2022,title={A Comparaison of Discrete and Soft Speech Units for Improved Voice Conversion},isbn={978-1-66540-540-9},booktitle={{ICASSP}},author={{van Niekerk}, Benjamin and Carbonneau, Marc-André and Zaidi, Julian and Baas, Matthew and Seuté, Hugo and Kamper, Herman},year={2022},pages={6562--6566},}
Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022
We present our entry to the GENEA Challenge of 2022 on data-driven co-speech gesture generation. Our system is a neural network that generates gesture animation from an input audio file. The motion style generated by the model is extracted from an exemplar motion clip. Style is embedded in a latent space using a variational framework. This architecture allows for generating in styles unseen during training. Moreover, the probabilistic nature of our variational framework furthermore enables the generation of a variety of outputs given the same input, addressing the stochastic nature of gesture motion. The GENEA challenge evaluation showed that our model produces full-body motion with highly competitive levels of human-likeness.
@inproceedings{ghorbani_exemplar-based_2022,title={Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022},isbn={978-1-4503-9390-4},url={https://dl.acm.org/doi/10.1145/3536221.3558068},doi={10.1145/3536221.3558068},booktitle={International Conference on Multimodal Intreraction},publisher={ACM},author={Ghorbani, Saeed and Ferstl, Ylva and Carbonneau, Marc-André},year={2022},pages={778--783},}
2021
Energy Disaggregation using Variational Autoencoders
Antoine Langevin, Marc-André Carbonneau, Mohamed Cheriet, and Ghyslain Gagnon
Non-intrusive load monitoring (NILM) is a technique that uses a single sensor to measure the total power consumption of a building. Using an energy disaggregation method, the consumption of individual appliances can be estimated from the aggregate measurement. Recent disaggregation algorithms have significantly improved the performance of NILM systems. However, the generalization capability of these methods to different houses as well as the disaggregation of multistate appliances are still major challenges. In this paper we address these issues and propose an energy disaggregation approach based on the variational autoencoders framework. The probabilistic encoder makes this approach an efficient model for encoding information relevant to the reconstruction of the target appliance consumption. In particular, the proposed model accurately generates more complex load profiles, thus improving the power signal reconstruction of multi-state appliances. Moreover, its regularized latent space improves the generalization capabilities of the model across different houses. The proposed model is compared to state-of-the-art NILM approaches on the UK-DALE and REFIT datasets, and yields competitive results. The mean absolute error reduces by 18% on average across all appliances compared to the state-of-the-art. The F1-Score increases by more than 11%, showing improvements for the detection of the target appliance in the aggregate measurement.
@article{langevin_energy_2021,volume={254},issn={03787788},doi={10.1016/j.enbuild.2021.111623},journal={Energy \& Buildings},author={Langevin, Antoine and Carbonneau, Marc-André and Cheriet, Mohamed and Gagnon, Ghyslain},year={2021},pages={111623},}
Artist guided generation of video game production quality face textures
Christian Murphy, Sudhir Mudur, Daniel Holden, Marc-André Carbonneau, Donya Ghafourzadeh, and Andre Beauchamp
We develop a high resolution face texture generation system which uses artist provided appearance controls as the conditions for a generative network. Artists are able to control various elements in the generated textures, such as the skin, eye, lip, and hair color. This is made possible by reparameterizing our dataset to the same UV mapping, allowing us to utilize image-to-image translation networks. Although our dataset is limited in size, only 126 samples in total, our system is still able to generate realistic face textures which strongly adhere to the input appearance attribute conditions because of our training augmentation methods. Once our system has generated the face texture, it is ready to be used in a modern game production environment. Thanks to our novel SuperResolution and material property recovery methods, our generated face textures are 4K resolution and have the associated material property maps required for raytraced rendering.
@article{MURPHY2021268,title={Artist guided generation of video game production quality face textures},journal={Computers & Graphics},volume={98},pages={268-279},year={2021},issn={0097-8493},doi={https://doi.org/10.1016/j.cag.2021.06.004},url={https://www.sciencedirect.com/science/article/pii/S0097849321001199},author={Murphy, Christian and Mudur, Sudhir and Holden, Daniel and Carbonneau, Marc-André and Ghafourzadeh, Donya and Beauchamp, Andre},keywords={Face texture generation, Artist guided, SuperResolution, BRDF recovery},}
2020
Feature Learning from Spectrograms for Assessment of Personality Traits
Several methods have recently been proposed to analyze speech and automatically infer the personality of the speaker. These methods often rely on prosodic and other hand crafted speech processing features extracted with off-the-shelf toolboxes. To achieve high accuracy, numerous features are typically extracted using complex and highly parameterized algorithms. In this paper, a new method based on feature learning and spectrogram analysis is proposed to simplify the feature extraction process while maintaining a high level of accuracy. The proposed method learns a dictionary of discriminant features from patches extracted in the spectrogram representations of training speech segments. Each speech segment is then encoded using the dictionary, and the resulting feature set is used to perform classification of personality traits. Experiments indicate that the proposed method achieves state-of-the-art results with a significant reduction in complexity when compared to the most recent reference methods. The number of features, and difficulties linked to the feature extraction process are greatly reduced as only one type of descriptors is used, for which the 6 parameters can be tuned automatically. In contrast, the simplest reference method uses 4 types of descriptors to which 6 functionals are applied, resulting in over 20 parameters to be tuned.
@article{Carbonneau2020_personality,author={Carbonneau, Marc-André and Granger, Eric and Attabi, Yazid and Gagnon, Ghyslain},journal={IEEE Transactions on Affective Computing},title={Feature Learning from Spectrograms for Assessment of Personality Traits},year={2020},volume={11},number={1},pages={25-31},doi={10.1109/TAFFC.2017.2763132}}
Appearance Controlled Face Texture Generation for Video Game Characters
Christian Murphy, Sudhir Mudur, Daniel Holden, Marc-André Carbonneau, Donya Ghafourzadeh, and Andre Beauchamp
In 13th ACM SIGGRAPH Conference on Motion, Interaction and Games, 2020
Manually creating realistic, digital human heads is a difficult and time-consuming task for artists. While 3D scanners and photogrammetry allow for quick and automatic reconstruction of heads, finding an actor who fits specific character appearance descriptions can be difficult. Moreover, modern open-world videogames feature several thousands of characters that cannot realistically all be cast and scanned. Therefore, researchers are investigating generative models to create heads fitting a specific character appearance description. While current methods are able to generate believable head shapes quite well, generating a corresponding high-resolution and high-quality texture which respects the character’s appearance description is not possible using current state of the art methods. This work presents a method that generates synthetic face textures under the following constraints: (i) there is no reference photograph to build the texture, (ii) game artists control the generative process by providing precise appearance attributes, the face shape, and the character’s age and gender, and (iii) the texture must be of adequately high resolution and look believable when applied to the given face shape. Our method builds upon earlier deep learning approaches addressing similar problems. We propose several key additions to these methods to be able to use them in our context, specifically for artist control and small training data. In spite of training with a limited amount of training data, just over 100 samples, our model produces realistic textures which comply to a diverse range of skin, hair, lip and iris colors specified through our intuitive description format and augmentation thereof.
@inproceedings{10.1145/3424636.3426898,author={Murphy, Christian and Mudur, Sudhir and Holden, Daniel and Carbonneau, Marc-Andr\'{e} and Ghafourzadeh, Donya and Beauchamp, Andre},title={Appearance Controlled Face Texture Generation for Video Game Characters},year={2020},isbn={9781450381710},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3424636.3426898},doi={10.1145/3424636.3426898},booktitle={13th ACM SIGGRAPH Conference on Motion, Interaction and Games},articleno={9},numpages={11},keywords={artist controlled character creation, image-to-image translation, face texture generation, fine facial features},location={Virtual Event, SC, USA},series={MIG '20},}
2019
Bag-Level Aggregation for Multiple-Instance Active Learning in Instance Classification Problems
A growing number of applications, e.g., video surveillance and medical image analysis, require training recognition systems from large amounts of weakly annotated data, while some targeted interactions with a domain expert are allowed to improve the training process. In such cases, active learning (AL) can reduce labeling costs for training a classifier by querying the expert to provide the labels of most informative instances. This paper focuses on AL methods for instance classification problems in multiple instance learning (MIL), where data are arranged into sets, called bags, which are weakly labeled. Most AL methods focus on single-instance learning problems. These methods are not suitable for MIL problems because they cannot account for the bag structure of data. In this paper, new methods for bag-level aggregation of instance informativeness are proposed for multiple instance AL (MIAL). The aggregated informativeness method identifies the most informative instances based on classifier uncertainty and queries bags incorporating the most information. The other proposed method, called cluster-based aggregative sampling, clusters data hierarchically in the instance space. The informativeness of instances is assessed by considering bag labels, inferred instance labels, and the proportion of labels that remain to be discovered in clusters. Both proposed methods significantly outperform reference methods in extensive experiments using benchmark data from several application domains. Results indicate that using an appropriate strategy to address MIAL problems yields a significant reduction in the number of queries needed to achieve the same level of performance as single-instance AL methods.
@article{carbonneau_bag-level_2019,author={Carbonneau, Marc-André and Granger, Eric and Gagnon, Ghyslain},journal={IEEE Transactions on Neural Networks and Learning Systems},title={Bag-Level Aggregation for Multiple-Instance Active Learning in Instance Classification Problems},year={2019},volume={30},number={5},pages={1441-1451},doi={10.1109/TNNLS.2018.2869164}}
2018
Multiple instance learning: A survey of problem characteristics and applications
Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research.
@article{Carbonneau2016Survey,title={Multiple instance learning: A survey of problem characteristics and applications},journal={Pattern Recognition},volume={77},pages={329-353},year={2018},issn={0031-3203},doi={https://doi.org/10.1016/j.patcog.2017.10.009},url={https://www.sciencedirect.com/science/article/pii/S0031320317304065},author={Carbonneau, Marc-André and Cheplygina, Veronika and Granger, Eric and Gagnon, Ghyslain},keywords={Multiple instance learning, Weakly supervised learning, Classification, Multi-instance learning, Computer vision, Computer aided diagnosis, Document classification, Drug activity prediction},}
2016
Score thresholding for accurate instance classification in multiple instance learning
Multiple instance learning (MIL) is a form of weakly supervised learning for problems in which training instances are arranged into bags, and a label is provided for whole bags but not for individual instances. Most proposed MIL algorithms focus on bag classification, but more recently, the classification of individual instances has attracted the attention of the pattern recognition community. While these two tasks are similar, there are important differences in the consequences of instance misclassification. In this paper, the scoring function learned by MIL classifiers for the bag classification task is exploited for instance classification by adjusting the decision threshold. A new criterion for the threshold adjustment is proposed and validated using 7 reference MIL algorithms on 3 real-world data sets from different application domains. Experiments show considerable improvements in accuracy over these algorithms for instance classification. In some applications, the unweighted average recall increases by as much as 18%, while the Fi-score increases by 12%.
@inproceedings{carbonneau_ipta2016,author={Carbonneau, Marc-André and Granger, Eric and Gagnon, Ghyslain},booktitle={2016 Sixth International Conference on Image Processing Theory, Tools and Applications (IPTA)},title={Score thresholding for accurate instance classification in multiple instance learning},year={2016},doi={10.1109/IPTA.2016.7821026}}
Robust multiple-instance learning ensembles using random subspace instance selection
Many real-world pattern recognition problems can be modeled using multiple-instance learning (MIL), where instances are grouped into bags, and each bag is assigned a label. State-of-the-art MIL methods provide a high level of performance when strong assumptions are made regarding the underlying data distributions, and the proportion of positive to negative instances in positive bags. In this paper, a new method called Random Subspace Instance Selection (RSIS) is proposed for the robust design of MIL ensembles without any prior assumptions on the data structure and the proportion of instances in bags. First, instance selection probabilities are computed based on training data clustered in random subspaces. A pool of classifiers is then generated using the training subsets created with these selection probabilities. By using RSIS, MIL ensembles are more robust to many data distributions and noise, and are not adversely affected by the proportion of positive instances in positive bags because training instances are repeatedly selected in a probabilistic manner. Moreover, RSIS also allows the identification of positive instances on an individual basis, as required in many practical applications. Results obtained with several real-world and synthetic databases show the robustness of MIL ensembles designed with the proposed RSIS method over a range of witness rates, noisy features and data distributions compared to reference methods in the literature. HighlightsA new method, Random Subspace Instance Selection, is proposed to design MIL ensembles.The method yields ensembles that are robust to variations of witness rate, data distributions and noise.The method yields state-of-the-art results on several benchmark data sets.
@article{CARBONNEAU_RSIS,title={Robust multiple-instance learning ensembles using random subspace instance selection},journal={Pattern Recognition},volume={58},pages={83-99},year={2016},issn={0031-3203},doi={https://doi.org/10.1016/j.patcog.2016.03.035},url={https://www.sciencedirect.com/science/article/pii/S0031320316300346},author={Carbonneau, Marc-André and Granger, Eric and Raymond, Alexandre J. and Gagnon, Ghyslain},keywords={Multiple-instance learning, Random subspace methods, Classifier ensembles, Instance selection, Weakly supervised learning, Classification, MIL},}
Witness identification in multiple instance learning using random subspaces
Multiple instance learning (MIL) is a form of weakly-supervised learning where instances are organized in bags. A label is provided for bags, but not for instances. MIL literature typically focuses on the classification of bags seen as one object, or as a combination of their instances. In both cases, performance is generally measured using labels assigned to entire bags. In this paper, the MIL problem is formulated as a knowledge discovery task for which algorithms seek to discover the witnesses (i.e. identifying positive instances), using the weak supervision provided by bag labels. Some MIL methods are suitable for instance classification, but perform poorly in application where the witness rate is low, or when the positive class distribution is multimodal. A new method that clusters data projected in random subspaces is proposed to perform witness identification in these adverse settings. The proposed method is assessed on MIL data sets from three application domains, and compared to 7 reference MIL algorithms for the witness identification task. The proposed algorithm constantly ranks among the best methods in all experiments, while all other methods perform unevenly across data sets.
@inproceedings{Carbonneau_icpr,author={Carbonneau, Marc-André and Granger, Eric and Gagnon, Ghyslain},booktitle={2016 23rd International Conference on Pattern Recognition (ICPR)},title={Witness identification in multiple instance learning using random subspaces},year={2016},volume={},number={},pages={3639-3644},doi={10.1109/ICPR.2016.7900199}}
2015
Real-time visual play-break detection in sport events using a context descriptor
The detection of play and break segments in team sports is an essential step towards the automation of live game capture and broadcast. This paper presents a two-stage hierarchical method for play-break detection in non-edited video feeds of sport events. Unlike most existing methods, this algorithm performs action and event recognition on content, and thus does not rely on production cues of broadcast feeds. Moreover, the method does not require player tracking, can be used in real-time, and can be easily adapted to different sports. In the first stage, bag-of-words event detectors are trained to recognize key events such as line changes, face-offs and preliminary play-breaks. In the second stage, the output of the detectors along with a novel feature based on spatio-temporal interest points are used to create a context descriptor for the final decision. Experiments demonstrate the efficiency of the proposed method on real hockey game footage, achieving 90% accuracy.
@inproceedings{Carbonneau2015_hockey,title={Real-time visual play-break detection in sport events using a context descriptor},doi={10.1109/ISCAS.2015.7169270},booktitle={Circuits and {Systems} ({ISCAS}), 2015 {IEEE} {International} {Symposium} on},author={Carbonneau, Marc-André and Raymond, Alexandre J. and Granger, Eric and Gagnon, Ghyslain},month=may,year={2015},pages={2808--2811}}
2013
Detection of alarms and warning signals on an digital in-ear device
Marc-André Carbonneau, Narimène Lezzoum, Jérémie Voix, and Ghyslain Gagnon
International Journal of Industrial Ergonomics, May 2013
A majority of workers in industrial environments must wear hearing protection devices. While these hearing protectors provide increased safety in terms of auditory health, in some conditions they also have the adverse effect of preventing individuals from hearing alarm and warning signals which seriously compromises their safety. Recent advances in the field of microelectronics allow the integration of tiny digital signal processors inside hearing protection devices. This paper develops new algorithms to automatically detect alarm signals in the digitized audio stream fed to the processor. This detection is performed in real-time with low latency to quickly inform the user of a dangerous situation. The algorithms were also optimized to require low computational resources due to the limited processing power of typical embedded electronic devices. The proposed algorithms detect periodicity of the signal amplitude in a determined frequency bandwidth. The system was simulated with a database of alarm signals from a major North-American manufacturer of industrial alarms and warning signals, mixed with typical environmental noises at signal-to-noise ratios ranging from 0 to 15 dBA. The results show an average true-positive recognition rate of 95% for pulsed alarms compliant to the ISO 7331 standard. The system can be optimized for specific alarms which results in near 100% true positive and 0.2% false positive recognition rates.
@article{CARBONNEAU_ALARM,title={Detection of alarms and warning signals on an digital in-ear device},journal={International Journal of Industrial Ergonomics},volume={43},number={6},pages={503-511},year={2013},note={Noise: Assessment & Control},issn={0169-8141},doi={https://doi.org/10.1016/j.ergon.2012.07.001},url={https://www.sciencedirect.com/science/article/pii/S0169814112000625},author={Carbonneau, Marc-André and Lezzoum, Narimène and Voix, Jérémie and Gagnon, Ghyslain},keywords={Pattern recognition, Digital signal processing, Hearing protection devices, Industrial worker safety},}
Recognition of blowing sound types for real-time implementation in mobile devices
Marc-André Carbonneau, Ghyslain Gagnon, Robert Sabourin, and Jean Dubois
In 2013 IEEE 11th International New Circuits and Systems Conference (NEWCAS), May 2013
This paper presents a system to recognize and classify sounds produced by human subjects blowing air by the mouth. The objective is to implement the system for fast recognition using low-complexity algorithms in a low-budget processor. Recognition is achieved using tailored band energy ratios, modified frequency centroid and a periodicity test based on spectrum autocorrelation. These lightweight feature extraction techniques are adapted to the particular task of recognition of blowing sound types. The classification is achieved by a naive Bayes classifier. The algorithm can be implemented in real-time (latency ≤ 100 ms) and experimental test results show average recognition rates over 94 %
@inproceedings{blow,author={Carbonneau, Marc-André and Gagnon, Ghyslain and Sabourin, Robert and Dubois, Jean},booktitle={2013 IEEE 11th International New Circuits and Systems Conference (NEWCAS)},title={Recognition of blowing sound types for real-time implementation in mobile devices},year={2013},pages={1-4},doi={10.1109/NEWCAS.2013.6573655}}