Word Representations Concentrate and This is Good News!

Romain Couillet; Yagmur Gizem Cinar; Éric Gaussier; Muhammad Imran

doi:10.18653/v1/2020.conll-1.25

Communication Dans Un Congrès Année : 2020

Word Representations Concentrate and This is Good News!

(1) , (2) , (2) , (2)

1
2

Romain Couillet

Fonction : Auteur
PersonId : 170874
IdHAL : romain-couillet
ORCID : 0000-0001-5755-2090
IdRef : 15645713X

GIPSA Pôle Géométrie, Apprentissage, Information et Algorithmes

Yagmur Gizem Cinar

Fonction : Auteur

Laboratoire d'Informatique de Grenoble

Éric Gaussier

Fonction : Auteur
PersonId : 182833
IdHAL : eric-gaussier
ORCID : 0000-0002-8858-3233
IdRef : 074308297

Laboratoire d'Informatique de Grenoble

Muhammad Imran

Fonction : Auteur
PersonId : 769398
ORCID : 0000-0003-1892-8379

Laboratoire d'Informatique de Grenoble

Résumé

This article establishes that, unlike the legacy tf*idf representation, recent natural language representations (word embedding vectors) tend to exhibit a so-called concentration of measure phenomenon, in the sense that, as the representation size p and database size n are both large, their behavior is similar to that of large dimensional Gaussian random vectors. This phenomenon may have important consequences as machine learning algorithms for natural language data could be amenable to improvement, thereby providing new theoretical insights into the field of natural language processing.

Mots clés

word embedding vectors machine learning algorithms

Domaines

Intelligence artificielle [cs.AI] Informatique et langage [cs.CL]

Fichier principal

CouilletRomain_CinarYagmurGizem_2020.pdf (3.31 Mo)

Origine	Fichiers éditeurs autorisés sur une archive ouverte

Anne-Christine Jacob : Connectez-vous pour contacter le contributeur

https://hal.univ-grenoble-alpes.fr/hal-03356609

Soumis le : lundi 4 octobre 2021-13:35:53

Dernière modification le : mercredi 18 décembre 2024-10:13:25

Archivage à long terme le : mercredi 5 janvier 2022-18:02:28

Dates et versions

hal-03356609 , version 1 (04-10-2021)

Licence

Paternité

Identifiants

HAL Id : hal-03356609 , version 1
DOI : 10.18653/v1/2020.conll-1.25

Citer

Romain Couillet, Yagmur Gizem Cinar, Éric Gaussier, Muhammad Imran. Word Representations Concentrate and This is Good News!. CoNLL 2020 - 24th Conference on Computational Natural Language Learning, Association for Computational Linguistics (ACL), Nov 2020, Online, France. pp.325-334, ⟨10.18653/v1/2020.conll-1.25⟩. ⟨hal-03356609⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS GIPSA LIG GIPSA-GAIA MIAI ANR LIG_SIDCH

106 Consultations

158 Téléchargements

Word Representations Concentrate and This is Good News!

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager