Extracting Sentence Simplification Pairs from French Comparable Corpora Using a Two-Step Filtering Method

Lucía Ormaechea; Nikos Tsourakis

Communication Dans Un Congrès Année : 2023

Extracting Sentence Simplification Pairs from French Comparable Corpora Using a Two-Step Filtering Method

(1, 2) , (1)

1
2

Lucía Ormaechea

Fonction : Auteur
PersonId : 1144128
IdHAL : lucia-ormaechea
ORCID : 0000-0002-1118-5675

Université de Genève = University of Geneva

Université Grenoble Alpes

Nikos Tsourakis

Fonction : Auteur
PersonId : 864109

Université de Genève = University of Geneva

Résumé

Automatic Text Simplification (ATS) aims at simplifying texts by reducing their linguistic complexity albeit retaining their meaning. While being an interesting task from a societal and computational perspective, the lack of monolingual parallel data prevents an agile implementation of ATS models, especially in less resource-rich languages than English. For these reasons, this paper investigates how to create a general-language parallel simplification dataset for French using a method to extract complex-simple sentence pairs from comparable corpora like Wikipedia and its simplified counterpart, Vikidia. By using a two-step automatic filtering process, we sequentially address the two primary conditions that must be satisfied for a simplified sentence to be considered valid: i) preservation of the original meaning, and ii) simplicity gain with respect to the source text. Using this approach, we provide a dataset of parallel sentence simplifications (WiViCo) that can be later used for training French sequence-to-sequence general-language ATS models.

Mots clés

Automatic text simplification comparable corpora Data mining methods

Domaines

Informatique [cs]

Fichier principal

2023_ORMAECHEA_SWISSTEXT.pdf (304.63 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Lucía Ormaechea : Connectez-vous pour contacter le contributeur

https://hal.univ-grenoble-alpes.fr/hal-04283235

Soumis le : lundi 13 novembre 2023-17:20:48

Dernière modification le : mercredi 18 décembre 2024-10:15:05

Dates et versions

hal-04283235 , version 1 (13-11-2023)

Identifiants

HAL Id : hal-04283235 , version 1

Citer

Lucía Ormaechea, Nikos Tsourakis. Extracting Sentence Simplification Pairs from French Comparable Corpora Using a Two-Step Filtering Method. Swiss Text Analytics Conference, Swiss Association for Natural Language Processing (SwissNLP), Jun 2023, Neuchâtel (CH), Switzerland. ⟨hal-04283235⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA ANR

10 Consultations

17 Téléchargements

Extracting Sentence Simplification Pairs from French Comparable Corpora Using a Two-Step Filtering Method

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager