Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition

Marah Halawa; Florian Blume; Pia Bideau; Martin Maier; Rasha Abdel Rahman; Olaf Hellwich

doi:10.1109/CVPRW63382.2024.00463

Communication Dans Un Congrès Année : 2024

Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition

(1) , (1) , (2) , (3) , (3) , (1)

1
2
3

Marah Halawa

Fonction : Auteur

Technical University of Berlin / Technische Universität Berlin

Florian Blume

Fonction : Auteur

Technical University of Berlin / Technische Universität Berlin

Pia Bideau

Fonction : Auteur

Apprentissage de modèles à partir de données massives

Martin Maier

Fonction : Auteur
PersonId : 1439976

Humboldt-Universität zu Berlin = Humboldt University of Berlin = Université Humboldt de Berlin

Rasha Abdel Rahman

Fonction : Auteur
PersonId : 1439977

Humboldt-Universität zu Berlin = Humboldt University of Berlin = Université Humboldt de Berlin

Olaf Hellwich

Fonction : Auteur

Technical University of Berlin / Technische Universität Berlin

Résumé

Human communication is multi-modal; e.g., face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence, it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition, given the ever-growing quantities of video data that capture human facial expressions, such systems should utilize raw unlabeled videos without requiring expensive annotations. Therefore, in this work, we employ a multitask multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our model combines three self-supervised objective functions: First, a multi-modal contrastive loss, that pulls diverse data modalities of the same video together in the representation space. Second, a multi-modal clustering loss that preserves the semantic structure of input data in the representation space. Finally, a multi-modal data reconstruction loss. We conduct a comprehensive study on this multimodal multi-task self-supervised learning method on three facial expression recognition benchmarks. To that end, we examine the performance of learning through different combinations of self-supervised tasks on the facial expression recognition downstream task. Our model ConCluGen outperforms several multi-modal self-supervised and fully supervised baselines on the CMU-MOSEI dataset. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks such as facial expression recognition, while also reducing the amount of manual annotations required. We release our pre-trained models as well as source code publicly 1 .

Mots clés

Facial Expression Recognition Multi modality Selfsupervised learning

Domaines

Informatique [cs]

Fichier principal

2404.10904v2.pdf (863)

Origine	Fichiers produits par l'(les) auteur(s)

Pia Bideau : Connectez-vous pour contacter le contributeur

https://hal.univ-grenoble-alpes.fr/hal-04778749

Soumis le : mardi 12 novembre 2024-18:16:25

Dernière modification le : vendredi 7 février 2025-18:36:52

Dates et versions

hal-04778749 , version 1 (12-11-2024)

Licence

Paternité

Identifiants

HAL Id : hal-04778749 , version 1
DOI : 10.1109/CVPRW63382.2024.00463

Citer

Marah Halawa, Florian Blume, Pia Bideau, Martin Maier, Rasha Abdel Rahman, et al.. Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition. CVPR 2024 - IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2024, Seattle, United States. pp.1-12, ⟨10.1109/CVPRW63382.2024.00463⟩. ⟨hal-04778749⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS INRIA INSMI LJK LJK_GI INRIA2 LJK-GI-THOTH MIAI ANR ANR-IA

18 Consultations

6 Téléchargements

Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager