Generating unlabelled data for a tri-training approach in a low resourced NER task

Hugo Boulanger; Thomas Lavergne; Sophie Rosset

doi:10.18653/v1/2022.deeplo-1.4

Communication Dans Un Congrès Année : 2022

Generating unlabelled data for a tri-training approach in a low resourced NER task

(1) , (1) , (1)

Hugo Boulanger

Fonction : Auteur
PersonId : 1174076
IdHAL : hugo-boulanger
ORCID : 0009-0003-0220-5691

Information, Langue Ecrite et Signée

Thomas Lavergne

Fonction : Auteur
PersonId : 1296801
IdHAL : lavergne-thomas

Information, Langue Ecrite et Signée

Sophie Rosset

Fonction : Auteur
PersonId : 14913
IdHAL : sophie-rosset
ORCID : 0000-0002-6865-4989
IdRef : 137157835

Information, Langue Ecrite et Signée

Résumé

Training a tagger for Named Entity Recognition (NER) requires a substantial amount of labeled data in the task domain. Manual labeling is a tedious and complicated task. Semisupervised learning methods can reduce the quantity of labeled data necessary to train a model. However, these methods require large quantities of unlabeled data, which remains an issue in many cases. We address this problem by generating unlabeled data. Large language models have proven to be powerful tools for text generation. We use their generative capacity to produce new sentences and variations of the sentences of our available data. This generation method, combined with a semi-supervised method, is evaluated on CoNLL and I2B2. We prepare both of these corpora to simulate a low resource setting. We obtain significant improvements for semisupervised learning with synthetic data against supervised learning on natural data.

Domaines

Informatique et langage [cs.CL]

Fichier principal

2022.deeplo-1.4.pdf (227.95 Ko)

Origine	Fichiers éditeurs autorisés sur une archive ouverte

Hugo Boulanger : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03813272

Soumis le : jeudi 13 octobre 2022-11:24:08

Dernière modification le : mardi 6 février 2024-14:40:07

Archivage à long terme le : samedi 14 janvier 2023-18:44:46

Dates et versions

hal-03813272 , version 1 (13-10-2022)

Identifiants

HAL Id : hal-03813272 , version 1
DOI : 10.18653/v1/2022.deeplo-1.4

Citer

Hugo Boulanger, Thomas Lavergne, Sophie Rosset. Generating unlabelled data for a tri-training approach in a low resourced NER task. Third Workshop on Deep Learning for Low-Resource Natural Language Processing, Jul 2022, Hybrid, Seattle, United States. pp.30-37, ⟨10.18653/v1/2022.deeplo-1.4⟩. ⟨hal-03813272⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA CENTRALESUPELEC GENCI UNIV-PARIS-SACLAY LISN GS-COMPUTER-SCIENCE LISN-ILES

331 Consultations

95 Téléchargements

Generating unlabelled data for a tri-training approach in a low resourced NER task

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager