Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Mostafa Sadeghi; Xavier Alameda-Pineda

doi:10.1109/TSP.2021.3066038

Article Dans Une Revue IEEE Transactions on Signal Processing Année : 2021

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

(1, 2) , (1, 3)

1
2
3

Mostafa Sadeghi

Fonction : Auteur
PersonId : 752828
IdHAL : msadeghi
ORCID : 0000-0002-0272-8017

Interpretation and Modelling of Images and Videos

Speech Modeling for Facilitating Oral-Based Communication

Xavier Alameda-Pineda

Fonction : Auteur
PersonId : 16186
IdHAL : xavier-alameda-pineda
ORCID : 0000-0002-5354-1084
IdRef : 18450919X

Interpretation and Modelling of Images and Videos

Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande

Résumé

In this paper, we are interested in unsupervised (unknown noise) speech enhancement using latent variable generative models. We propose to learn a generative model for clean speech spectrogram based on a variational autoencoder (VAE) where a mixture of audio and visual networks is used to infer the posterior of the latent variables. This is motivated by the fact that visual data, i.e. lips images of the speaker, provide helpful and complementary information about speech. As such, they can help train a richer inference network, where the audio and visual information are fused. Moreover, during speech enhancement, visual data are used to initialize the latent variables, thus providing a more robust initialization than using the noisy speech spectrogram. A variational inference approach is derived to train the proposed VAE. Thanks to the novel inference procedure and the robust initialization, the proposed audio-visual VAE exhibits superior performance on speech enhancement than using the standard audio-only counterpart.

Mots clés

Audio-visual speech enhancement generative models variational auto-encoder mixture model

Domaines

Vision par ordinateur et reconnaissance de formes [cs.CV] Traitement du signal et de l'image [eess.SP] Apprentissage [cs.LG] Son [cs.SD]

Fichier principal

main.pdf (4.4 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Xavier Alameda-Pineda : Connectez-vous pour contacter le contributeur

https://inria.hal.science/hal-02926172

Soumis le : mercredi 26 janvier 2022-11:41:26

Dernière modification le : jeudi 4 avril 2024-21:13:21

Dates et versions

hal-02926172 , version 1 (09-03-2021)

hal-02926172 , version 2 (26-01-2022)

Identifiants

HAL Id : hal-02926172 , version 2
ARXIV : 1912.10647
DOI : 10.1109/TSP.2021.3066038

Citer

Mostafa Sadeghi, Xavier Alameda-Pineda. Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement. IEEE Transactions on Signal Processing, 2021, 69, pp.1899-1909. ⟨10.1109/TSP.2021.3066038⟩. ⟨hal-02926172v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 UGA CNRS INRIA IRISA LJK LJK_GI LJK_GI_PERCEPTION UNIV-LORRAINE INRIA2 LORIA LORIA-NLPKD UR1-MATH-STIC UR1-UFR-ISTIC UNIV-RENNES MIAI ANR UR1-MATH-NUM LJK-GI-ROBOTLEARN

243 Consultations

237 Téléchargements

Mixture of Inference Networks for VAE-based Audio-visual Speech Enhancement

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager