TY - JOUR
T1 - Dialectones: Finding statistically significant dialectal boundaries using twitter data
AU - Rodriguez-Diaz, Carlos A.
AU - Jimenez, Sergio
AU - Dueñas, George
AU - Bonilla, Johnatan Estiven
AU - Gelbukh, Alexander
N1 - Publisher Copyright:
© 2018 Instituto Politecnico Nacional. All rights reserved.
PY - 2018/1/1
Y1 - 2018/1/1
N2 - Most NLP applications assume that a particular language is homogeneous in the regions where it is spoken. However, each language varies considerably throughout its geographical distribution. To make NLP sensitive to dialects, a reliable, representative and up-to-date source of information that quantitatively represents such geographical variation is necessary. However, some of the current approaches have disadvantages such as the need for parameters, the disregard of the geographical coordinates in the analysis, and the use of linguistic alternations that presuppose the existence of specific dialectal varieties. Detection of “ecotones” is an analogous problem in the field of ecology that focuses on the identification of boundaries, instead of regions, in ecosystems facilitating the construction of statistical tests. We adapted the concept of “ecotone” to “dialectone” for the detection of dialectal boundaries by using two non-parametric statistical tests: the Hilbert-Schmidt independence criterion (HSIC) and the Wilcoxon signed-rank. The proposed method was applied to a large corpus of Spanish tweets produced in 160 locations in Colombia through the analysis of unigram features. The resulting dialectones showed to be meaningful but difficult to compare against regions identified by other authors using classical dialectometry. We concluded that the automatic detection of dialectones is convenient alternative to classical methods in dialectometry and a potential source of information for automatic language applications.
AB - Most NLP applications assume that a particular language is homogeneous in the regions where it is spoken. However, each language varies considerably throughout its geographical distribution. To make NLP sensitive to dialects, a reliable, representative and up-to-date source of information that quantitatively represents such geographical variation is necessary. However, some of the current approaches have disadvantages such as the need for parameters, the disregard of the geographical coordinates in the analysis, and the use of linguistic alternations that presuppose the existence of specific dialectal varieties. Detection of “ecotones” is an analogous problem in the field of ecology that focuses on the identification of boundaries, instead of regions, in ecosystems facilitating the construction of statistical tests. We adapted the concept of “ecotone” to “dialectone” for the detection of dialectal boundaries by using two non-parametric statistical tests: the Hilbert-Schmidt independence criterion (HSIC) and the Wilcoxon signed-rank. The proposed method was applied to a large corpus of Spanish tweets produced in 160 locations in Colombia through the analysis of unigram features. The resulting dialectones showed to be meaningful but difficult to compare against regions identified by other authors using classical dialectometry. We concluded that the automatic detection of dialectones is convenient alternative to classical methods in dialectometry and a potential source of information for automatic language applications.
KW - Corpus-based dialectometry
KW - Dialectometry
KW - Dialectone
KW - Ecotone
KW - Hilbert-Schmidt independence criterion
KW - Nonparametric method
KW - Wilcoxon signed-rank test
UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85069591779&origin=inward
UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=85069591779&origin=inward
U2 - 10.13053/CyS-22-4-3104
DO - 10.13053/CyS-22-4-3104
M3 - Artículo
AN - SCOPUS:85069591779
SN - 1405-5546
VL - 22
SP - 1213
EP - 1222
JO - Computacion y Sistemas
JF - Computacion y Sistemas
IS - 4
ER -