Automatic detection of regional words for pan-hispanic spanish on twitter

Sergio Jimenez; George Dueñas; Alexander Gelbukh; Carlos A. Rodriguez-Diaz; Sergio Mancera

doi:10.1007/978-3-030-03928-8_33

Automatic detection of regional words for pan-hispanic spanish on twitter

Sergio Jimenez, George Dueñas, Alexander Gelbukh, Carlos A. Rodriguez-Diaz, Sergio Mancera

Centro de Investigación en Computación (CIC)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

4 Scopus citations

Abstract

Languages, such as Spanish, spoken by hundreds of millions of people in large geographic areas are subject to a high degree of regional variation. Regional words are frequently used in informal contexts, but their meaning is shared only by a relatively small group of people. Dealing with these regionalisms is a challenge for most applications in the field of Natural Language Processing. We propose a novel method to identify regional words and provide their meaning based on a large corpus of geolocated ‘tweets’. The method combines the notions of specificity (tf-idf), space correlation (HSIC) and neural word embedding (word2vec) to produce a list of words ranked by their degree of regionalism along with their meaning represented by a set of words semantically related and examples of use. The method was evaluated against lists of regional words taken from regional dictionaries produced by lexicographers and from collaborative websites where users contribute freely with regional words. We tested the effectiveness of the proposed method and produced a new resource for 21 Spanish-speaking countries composed of 5,000 regional words per country along with similar words and example ‘tweets’.

Original language	English
Title of host publication	Advances in Artificial Intelligence – IBERAMIA 2018 - 16th Ibero-American Conference on AI, Proceedings
Editors	Eduardo Fermé, Guillermo R. Simari, Flabio Gutiérrez Segura, José Antonio Rodríguez Melquiades
Publisher	Springer Verlag
Pages	404-416
Number of pages	13
ISBN (Print)	9783030039271
DOIs	https://doi.org/10.1007/978-3-030-03928-8_33
State	Published - 2018
Event	16th Ibero-American Conference on Artificial Intelligence, IBERAMIA 2018 - Trujillo, Peru Duration: 13 Nov 2018 → 16 Nov 2018

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	11238 LNAI
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	16th Ibero-American Conference on Artificial Intelligence, IBERAMIA 2018
Country/Territory	Peru
City	Trujillo
Period	13/11/18 → 16/11/18

Keywords

Automatic regional words detection
HSIC
Regionalisms meaning
Spanish regionalisms
TF-IDF
Word2vec

Access to Document

10.1007/978-3-030-03928-8_33

Cite this

Jimenez, S., Dueñas, G., Gelbukh, A., Rodriguez-Diaz, C. A., & Mancera, S. (2018). Automatic detection of regional words for pan-hispanic spanish on twitter. In E. Fermé, G. R. Simari, F. Gutiérrez Segura, & J. A. Rodríguez Melquiades (Eds.), Advances in Artificial Intelligence – IBERAMIA 2018 - 16th Ibero-American Conference on AI, Proceedings (pp. 404-416). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11238 LNAI). Springer Verlag. https://doi.org/10.1007/978-3-030-03928-8_33

Jimenez, Sergio ; Dueñas, George ; Gelbukh, Alexander et al. / Automatic detection of regional words for pan-hispanic spanish on twitter. Advances in Artificial Intelligence – IBERAMIA 2018 - 16th Ibero-American Conference on AI, Proceedings. editor / Eduardo Fermé ; Guillermo R. Simari ; Flabio Gutiérrez Segura ; José Antonio Rodríguez Melquiades. Springer Verlag, 2018. pp. 404-416 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{7a57ec2bbc6240e28bd564d02363f15b,

title = "Automatic detection of regional words for pan-hispanic spanish on twitter",

abstract = "Languages, such as Spanish, spoken by hundreds of millions of people in large geographic areas are subject to a high degree of regional variation. Regional words are frequently used in informal contexts, but their meaning is shared only by a relatively small group of people. Dealing with these regionalisms is a challenge for most applications in the field of Natural Language Processing. We propose a novel method to identify regional words and provide their meaning based on a large corpus of geolocated {\textquoteleft}tweets{\textquoteright}. The method combines the notions of specificity (tf-idf), space correlation (HSIC) and neural word embedding (word2vec) to produce a list of words ranked by their degree of regionalism along with their meaning represented by a set of words semantically related and examples of use. The method was evaluated against lists of regional words taken from regional dictionaries produced by lexicographers and from collaborative websites where users contribute freely with regional words. We tested the effectiveness of the proposed method and produced a new resource for 21 Spanish-speaking countries composed of 5,000 regional words per country along with similar words and example {\textquoteleft}tweets{\textquoteright}.",

keywords = "Automatic regional words detection, HSIC, Regionalisms meaning, Spanish regionalisms, TF-IDF, Word2vec",

author = "Sergio Jimenez and George Due{\~n}as and Alexander Gelbukh and Rodriguez-Diaz, {Carlos A.} and Sergio Mancera",

note = "Publisher Copyright: {\textcopyright} Springer Nature Switzerland AG 2018.; 16th Ibero-American Conference on Artificial Intelligence, IBERAMIA 2018 ; Conference date: 13-11-2018 Through 16-11-2018",

year = "2018",

doi = "10.1007/978-3-030-03928-8_33",

language = "Ingl{\'e}s",

isbn = "9783030039271",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "404--416",

editor = "Eduardo Ferm{\'e} and Simari, {Guillermo R.} and {Guti{\'e}rrez Segura}, Flabio and {Rodr{\'i}guez Melquiades}, {Jos{\'e} Antonio}",

booktitle = "Advances in Artificial Intelligence – IBERAMIA 2018 - 16th Ibero-American Conference on AI, Proceedings",

address = "Alemania",

}

Jimenez, S, Dueñas, G, Gelbukh, A, Rodriguez-Diaz, CA & Mancera, S 2018, Automatic detection of regional words for pan-hispanic spanish on twitter. in E Fermé, GR Simari, F Gutiérrez Segura & JA Rodríguez Melquiades (eds), Advances in Artificial Intelligence – IBERAMIA 2018 - 16th Ibero-American Conference on AI, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11238 LNAI, Springer Verlag, pp. 404-416, 16th Ibero-American Conference on Artificial Intelligence, IBERAMIA 2018, Trujillo, Peru, 13/11/18. https://doi.org/10.1007/978-3-030-03928-8_33

Automatic detection of regional words for pan-hispanic spanish on twitter. / Jimenez, Sergio; Dueñas, George; Gelbukh, Alexander et al.
Advances in Artificial Intelligence – IBERAMIA 2018 - 16th Ibero-American Conference on AI, Proceedings. ed. / Eduardo Fermé; Guillermo R. Simari; Flabio Gutiérrez Segura; José Antonio Rodríguez Melquiades. Springer Verlag, 2018. p. 404-416 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11238 LNAI).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Automatic detection of regional words for pan-hispanic spanish on twitter

AU - Jimenez, Sergio

AU - Dueñas, George

AU - Gelbukh, Alexander

AU - Rodriguez-Diaz, Carlos A.

AU - Mancera, Sergio

N1 - Publisher Copyright: © Springer Nature Switzerland AG 2018.

PY - 2018

Y1 - 2018

N2 - Languages, such as Spanish, spoken by hundreds of millions of people in large geographic areas are subject to a high degree of regional variation. Regional words are frequently used in informal contexts, but their meaning is shared only by a relatively small group of people. Dealing with these regionalisms is a challenge for most applications in the field of Natural Language Processing. We propose a novel method to identify regional words and provide their meaning based on a large corpus of geolocated ‘tweets’. The method combines the notions of specificity (tf-idf), space correlation (HSIC) and neural word embedding (word2vec) to produce a list of words ranked by their degree of regionalism along with their meaning represented by a set of words semantically related and examples of use. The method was evaluated against lists of regional words taken from regional dictionaries produced by lexicographers and from collaborative websites where users contribute freely with regional words. We tested the effectiveness of the proposed method and produced a new resource for 21 Spanish-speaking countries composed of 5,000 regional words per country along with similar words and example ‘tweets’.

AB - Languages, such as Spanish, spoken by hundreds of millions of people in large geographic areas are subject to a high degree of regional variation. Regional words are frequently used in informal contexts, but their meaning is shared only by a relatively small group of people. Dealing with these regionalisms is a challenge for most applications in the field of Natural Language Processing. We propose a novel method to identify regional words and provide their meaning based on a large corpus of geolocated ‘tweets’. The method combines the notions of specificity (tf-idf), space correlation (HSIC) and neural word embedding (word2vec) to produce a list of words ranked by their degree of regionalism along with their meaning represented by a set of words semantically related and examples of use. The method was evaluated against lists of regional words taken from regional dictionaries produced by lexicographers and from collaborative websites where users contribute freely with regional words. We tested the effectiveness of the proposed method and produced a new resource for 21 Spanish-speaking countries composed of 5,000 regional words per country along with similar words and example ‘tweets’.

KW - Automatic regional words detection

KW - HSIC

KW - Regionalisms meaning

KW - Spanish regionalisms

KW - TF-IDF

KW - Word2vec

UR - http://www.scopus.com/inward/record.url?scp=85057107918&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-03928-8_33

DO - 10.1007/978-3-030-03928-8_33

M3 - Contribución a la conferencia

SN - 9783030039271

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 404

EP - 416

BT - Advances in Artificial Intelligence – IBERAMIA 2018 - 16th Ibero-American Conference on AI, Proceedings

A2 - Fermé, Eduardo

A2 - Simari, Guillermo R.

A2 - Gutiérrez Segura, Flabio

A2 - Rodríguez Melquiades, José Antonio

PB - Springer Verlag

T2 - 16th Ibero-American Conference on Artificial Intelligence, IBERAMIA 2018

Y2 - 13 November 2018 through 16 November 2018

ER -

Jimenez S, Dueñas G, Gelbukh A, Rodriguez-Diaz CA, Mancera S. Automatic detection of regional words for pan-hispanic spanish on twitter. In Fermé E, Simari GR, Gutiérrez Segura F, Rodríguez Melquiades JA, editors, Advances in Artificial Intelligence – IBERAMIA 2018 - 16th Ibero-American Conference on AI, Proceedings. Springer Verlag. 2018. p. 404-416. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-03928-8_33

Automatic detection of regional words for pan-hispanic spanish on twitter

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this