Audiocite.net: A Large Spoken Read Dataset in French
Abstract
The advent of self-supervised learning (SSL) in speech processing has allowed the use of large unlabeled datasets
to learn pre-trained models, serving as powerful encoders for various downstream tasks. However, the application
of these SSL methods to languages such as French has proved difficult due to the scarcity of large French speech
datasets. To advance the emergence of pre-trained models for French speech, we present the Audiocite.net
corpus composed of 6 682 hours of recordings from 130 readers. This corpus is built from audiobooks from
the audiocite.net website. In addition to describing the creation process and final statistics, we also show how
this dataset impacted the models of LeBenchmark project in its 14k version for speech processing downstream tasks.
Origin | Files produced by the author(s) |
---|