Robust Unsupervised Audio-visual Speech Enhancement Using a Mixture of Variational Autoencoders - Université Grenoble Alpes Accéder directement au contenu
Communication Dans Un Congrès Année : 2020

Robust Unsupervised Audio-visual Speech Enhancement Using a Mixture of Variational Autoencoders

Résumé

Recently, an audiovisual speech generative model based on variational autoencoder (VAE) has been proposed, which is combined with a nonnegative matrix factorization (NMF) model for noise variance to perform unsupervised speech enhancement. When visual data is clean, speech enhancement with audiovisual VAE shows a better performance than with audio-only VAE, which is trained on audio-only data. However, audiovisual VAE is not robust against noisy visual data, e.g., when for some video frames, speaker face is not frontal or lips region is occluded. In this paper, we propose a robust unsupervised audiovisual speech enhancement method based on a per-frame VAE mixture model. This mixture model consists of a trained audio-only VAE and a trained audiovisual VAE. The motivation is to skip noisy visual frames by switching to the audio-only VAE model. We present a variational expectation-maximization method to estimate the parameters of the model. Experiments show the promising performance of the proposed method.
Fichier principal
Vignette du fichier
mix_vae_conf_v2.pdf (249.18 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-02534911 , version 1 (07-04-2020)

Identifiants

Citer

Mostafa Sadeghi, Xavier Alameda-Pineda. Robust Unsupervised Audio-visual Speech Enhancement Using a Mixture of Variational Autoencoders. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, Barcelona, Spain. pp.7534-7538, ⟨10.1109/ICASSP40776.2020.9053730⟩. ⟨hal-02534911⟩
149 Consultations
105 Téléchargements

Altmetric

Partager

Gmail Facebook X LinkedIn More