Arrêt de service programmé du vendredi 10 juin 16h jusqu’au lundi 13 juin 9h. Pour en savoir plus
Accéder directement au contenu Accéder directement à la navigation
Article dans une revue

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Mostafa Sadeghi 1, 2 Xavier Alameda-Pineda 1, 3
1 PERCEPTION - Interpretation and Modelling of Images and Videos
Inria Grenoble - Rhône-Alpes, LJK - Laboratoire Jean Kuntzmann, Grenoble INP - Institut polytechnique de Grenoble - Grenoble Institute of Technology
2 MULTISPEECH - Speech Modeling for Facilitating Oral-Based Communication
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : In this paper, we are interested in unsupervised (unknown noise) speech enhancement using latent variable generative models. We propose to learn a generative model for clean speech spectrogram based on a variational autoencoder (VAE) where a mixture of audio and visual networks is used to infer the posterior of the latent variables. This is motivated by the fact that visual data, i.e. lips images of the speaker, provide helpful and complementary information about speech. As such, they can help train a richer inference network, where the audio and visual information are fused. Moreover, during speech enhancement, visual data are used to initialize the latent variables, thus providing a more robust initialization than using the noisy speech spectrogram. A variational inference approach is derived to train the proposed VAE. Thanks to the novel inference procedure and the robust initialization, the proposed audio-visual VAE exhibits superior performance on speech enhancement than using the standard audio-only counterpart.
Liste complète des métadonnées

https://hal.inria.fr/hal-02926172
Contributeur : Xavier Alameda-Pineda Connectez-vous pour contacter le contributeur
Soumis le : mercredi 26 janvier 2022 - 11:41:26
Dernière modification le : mercredi 4 mai 2022 - 11:58:03

Fichier

main.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Citation

Mostafa Sadeghi, Xavier Alameda-Pineda. Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement. IEEE Transactions on Signal Processing, Institute of Electrical and Electronics Engineers, 2021, 69, pp.1899-1909. ⟨10.1109/TSP.2021.3066038⟩. ⟨hal-02926172v2⟩

Partager

Métriques

Consultations de la notice

144

Téléchargements de fichiers

169