Extracting Sentence Simplification Pairs from French Comparable Corpora Using a Two-Step Filtering Method - Université Grenoble Alpes Accéder directement au contenu
Communication Dans Un Congrès Année : 2023

Extracting Sentence Simplification Pairs from French Comparable Corpora Using a Two-Step Filtering Method

Résumé

Automatic Text Simplification (ATS) aims at simplifying texts by reducing their linguistic complexity albeit retaining their meaning. While being an interesting task from a societal and computational perspective, the lack of monolingual parallel data prevents an agile implementation of ATS models, especially in less resource-rich languages than English. For these reasons, this paper investigates how to create a general-language parallel simplification dataset for French using a method to extract complex-simple sentence pairs from comparable corpora like Wikipedia and its simplified counterpart, Vikidia. By using a two-step automatic filtering process, we sequentially address the two primary conditions that must be satisfied for a simplified sentence to be considered valid: i) preservation of the original meaning, and ii) simplicity gain with respect to the source text. Using this approach, we provide a dataset of parallel sentence simplifications (WiViCo) that can be later used for training French sequence-to-sequence general-language ATS models.
Fichier principal
Vignette du fichier
2023_ORMAECHEA_SWISSTEXT.pdf (304.63 Ko) Télécharger le fichier
Origine Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04283235 , version 1 (13-11-2023)

Identifiants

  • HAL Id : hal-04283235 , version 1

Citer

Lucía Ormaechea, Nikos Tsourakis. Extracting Sentence Simplification Pairs from French Comparable Corpora Using a Two-Step Filtering Method. Swiss Text Analytics Conference, Swiss Association for Natural Language Processing (SwissNLP), Jun 2023, Neuchâtel (CH), Switzerland. ⟨hal-04283235⟩

Collections

UGA ANR
5 Consultations
8 Téléchargements

Partager

Gmail Mastodon Facebook X LinkedIn More