Audiocite.net: A Large Spoken Read Dataset in French

The advent of self-supervised learning (SSL) in speech processing has allowed the use of large unlabeled datasets to learn pre-trained models, serving as powerful encoders for various downstream tasks. However, the application of these SSL methods to languages such as French has proved difficult due to the scarcity of large French speech datasets. To advance the emergence of pre-trained models for French speech, we present the Audiocite.net corpus composed of 6 682 hours of recordings from 130 readers. This corpus is built from audiobooks from the audiocite.net website. In addition to describing the creation process and final statistics, we also show how this dataset impacted the models of LeBenchmark project in its 14k version for speech processing downstream tasks.

Mots clés

Spoken Datasets French Speech Self Supervised Learning Automatic Speech Processing

Domaines

Sciences de l'information et de la communication Intelligence artificielle [cs.AI]

Fichier principal

2024_Felice_audiocite.pdf (69.23 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Solène Evain : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04533994

Soumis le : mardi 9 avril 2024-11:01:02

Dernière modification le : lundi 9 décembre 2024-03:22:04

Archivage à long terme le : mercredi 10 juillet 2024-18:12:40

Dates et versions

hal-04533994 , version 1 (09-04-2024)

Identifiants

HAL Id : hal-04533994 , version 1

Citer

Soline Felice, Solène Evain, Solange Rossato, François Portet. Audiocite.net: A Large Spoken Read Dataset in French. The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024, Turin, Italy. ⟨hal-04533994⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS LIG LIG_TDCGE_GETALP MIAI ANR LIG_SIDCH

117 Consultations

114 Téléchargements