Automatic detection of regional words for pan-hispanic spanish on twitter

Sergio Jimenez, George Dueñas, Alexander Gelbukh, Carlos A. Rodriguez-Diaz, Sergio Mancera

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

4 Citas (Scopus)

Resumen

Languages, such as Spanish, spoken by hundreds of millions of people in large geographic areas are subject to a high degree of regional variation. Regional words are frequently used in informal contexts, but their meaning is shared only by a relatively small group of people. Dealing with these regionalisms is a challenge for most applications in the field of Natural Language Processing. We propose a novel method to identify regional words and provide their meaning based on a large corpus of geolocated ‘tweets’. The method combines the notions of specificity (tf-idf), space correlation (HSIC) and neural word embedding (word2vec) to produce a list of words ranked by their degree of regionalism along with their meaning represented by a set of words semantically related and examples of use. The method was evaluated against lists of regional words taken from regional dictionaries produced by lexicographers and from collaborative websites where users contribute freely with regional words. We tested the effectiveness of the proposed method and produced a new resource for 21 Spanish-speaking countries composed of 5,000 regional words per country along with similar words and example ‘tweets’.

Idioma originalInglés
Título de la publicación alojadaAdvances in Artificial Intelligence – IBERAMIA 2018 - 16th Ibero-American Conference on AI, Proceedings
EditoresEduardo Fermé, Guillermo R. Simari, Flabio Gutiérrez Segura, José Antonio Rodríguez Melquiades
EditorialSpringer Verlag
Páginas404-416
Número de páginas13
ISBN (versión impresa)9783030039271
DOI
EstadoPublicada - 2018
Evento16th Ibero-American Conference on Artificial Intelligence, IBERAMIA 2018 - Trujillo, Perú
Duración: 13 nov. 201816 nov. 2018

Serie de la publicación

NombreLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volumen11238 LNAI
ISSN (versión impresa)0302-9743
ISSN (versión digital)1611-3349

Conferencia

Conferencia16th Ibero-American Conference on Artificial Intelligence, IBERAMIA 2018
País/TerritorioPerú
CiudadTrujillo
Período13/11/1816/11/18

Huella

Profundice en los temas de investigación de 'Automatic detection of regional words for pan-hispanic spanish on twitter'. En conjunto forman una huella única.

Citar esto