Word Representations Concentrate and This is Good News! - Archive ouverte HAL Access content directly
Conference Papers Year : 2020

Word Representations Concentrate and This is Good News!

Abstract

This article establishes that, unlike the legacy tf*idf representation, recent natural language representations (word embedding vectors) tend to exhibit a so-called concentration of measure phenomenon, in the sense that, as the representation size p and database size n are both large, their behavior is similar to that of large dimensional Gaussian random vectors. This phenomenon may have important consequences as machine learning algorithms for natural language data could be amenable to improvement, thereby providing new theoretical insights into the field of natural language processing.
Fichier principal
Vignette du fichier
CouilletRomain_CinarYagmurGizem_2020.pdf (3.31 Mo) Télécharger le fichier
Origin : Publisher files allowed on an open archive

Dates and versions

hal-03356609 , version 1 (04-10-2021)

Licence

Attribution - CC BY 4.0

Identifiers

Cite

Romain Couillet, Yagmur Gizem Cinar, Éric Gaussier, Muhammad Imran. Word Representations Concentrate and This is Good News!. CoNLL 2020 - 24th Conference on Computational Natural Language Learning, Association for Computational Linguistics (ACL), Nov 2020, Online, France. pp.325-334, ⟨10.18653/v1/2020.conll-1.25⟩. ⟨hal-03356609⟩
67 View
57 Download

Altmetric

Share

Gmail Facebook Twitter LinkedIn More