Extracting Sentence Simplification Pairs from French Comparable Corpora Using a Two-Step Filtering Method
Résumé
Automatic Text Simplification (ATS) aims at simplifying texts by reducing their linguistic complexity albeit retaining their meaning. While being an interesting task from a societal and computational perspective, the lack of monolingual parallel data prevents an agile implementation of ATS models, especially in less resource-rich languages than English. For these reasons, this paper investigates how to create a general-language parallel simplification dataset for French using a method to extract complex-simple sentence pairs from comparable corpora like Wikipedia and its simplified counterpart, Vikidia. By using a two-step automatic filtering process, we sequentially address the two primary conditions that must be satisfied for a simplified sentence to be considered valid: i) preservation of the original meaning, and ii) simplicity gain with respect to the source text. Using this approach, we provide a dataset of parallel sentence simplifications (WiViCo) that can be later used for training French sequence-to-sequence general-language ATS models.
Domaines
Informatique [cs]Origine | Fichiers produits par l'(les) auteur(s) |
---|