TY - GEN
T1 - Automatic detection of regional words for pan-hispanic spanish on twitter
AU - Jimenez, Sergio
AU - Dueñas, George
AU - Gelbukh, Alexander
AU - Rodriguez-Diaz, Carlos A.
AU - Mancera, Sergio
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2018.
PY - 2018
Y1 - 2018
N2 - Languages, such as Spanish, spoken by hundreds of millions of people in large geographic areas are subject to a high degree of regional variation. Regional words are frequently used in informal contexts, but their meaning is shared only by a relatively small group of people. Dealing with these regionalisms is a challenge for most applications in the field of Natural Language Processing. We propose a novel method to identify regional words and provide their meaning based on a large corpus of geolocated ‘tweets’. The method combines the notions of specificity (tf-idf), space correlation (HSIC) and neural word embedding (word2vec) to produce a list of words ranked by their degree of regionalism along with their meaning represented by a set of words semantically related and examples of use. The method was evaluated against lists of regional words taken from regional dictionaries produced by lexicographers and from collaborative websites where users contribute freely with regional words. We tested the effectiveness of the proposed method and produced a new resource for 21 Spanish-speaking countries composed of 5,000 regional words per country along with similar words and example ‘tweets’.
AB - Languages, such as Spanish, spoken by hundreds of millions of people in large geographic areas are subject to a high degree of regional variation. Regional words are frequently used in informal contexts, but their meaning is shared only by a relatively small group of people. Dealing with these regionalisms is a challenge for most applications in the field of Natural Language Processing. We propose a novel method to identify regional words and provide their meaning based on a large corpus of geolocated ‘tweets’. The method combines the notions of specificity (tf-idf), space correlation (HSIC) and neural word embedding (word2vec) to produce a list of words ranked by their degree of regionalism along with their meaning represented by a set of words semantically related and examples of use. The method was evaluated against lists of regional words taken from regional dictionaries produced by lexicographers and from collaborative websites where users contribute freely with regional words. We tested the effectiveness of the proposed method and produced a new resource for 21 Spanish-speaking countries composed of 5,000 regional words per country along with similar words and example ‘tweets’.
KW - Automatic regional words detection
KW - HSIC
KW - Regionalisms meaning
KW - Spanish regionalisms
KW - TF-IDF
KW - Word2vec
UR - http://www.scopus.com/inward/record.url?scp=85057107918&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-03928-8_33
DO - 10.1007/978-3-030-03928-8_33
M3 - Contribución a la conferencia
SN - 9783030039271
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 404
EP - 416
BT - Advances in Artificial Intelligence – IBERAMIA 2018 - 16th Ibero-American Conference on AI, Proceedings
A2 - Fermé, Eduardo
A2 - Simari, Guillermo R.
A2 - Gutiérrez Segura, Flabio
A2 - Rodríguez Melquiades, José Antonio
PB - Springer Verlag
T2 - 16th Ibero-American Conference on Artificial Intelligence, IBERAMIA 2018
Y2 - 13 November 2018 through 16 November 2018
ER -