Dialectones: Finding statistically significant dialectal boundaries using twitter data

Carlos A. Rodriguez-Diaz; Sergio Jimenez; George Dueñas; Johnatan Estiven Bonilla; Alexander Gelbukh

doi:10.13053/CyS-22-4-3104

Dialectones: Finding statistically significant dialectal boundaries using twitter data

Carlos A. Rodriguez-Diaz, Sergio Jimenez, George Dueñas, Johnatan Estiven Bonilla, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

Most NLP applications assume that a particular language is homogeneous in the regions where it is spoken. However, each language varies considerably throughout its geographical distribution. To make NLP sensitive to dialects, a reliable, representative and up-to-date source of information that quantitatively represents such geographical variation is necessary. However, some of the current approaches have disadvantages such as the need for parameters, the disregard of the geographical coordinates in the analysis, and the use of linguistic alternations that presuppose the existence of specific dialectal varieties. Detection of “ecotones” is an analogous problem in the field of ecology that focuses on the identification of boundaries, instead of regions, in ecosystems facilitating the construction of statistical tests. We adapted the concept of “ecotone” to “dialectone” for the detection of dialectal boundaries by using two non-parametric statistical tests: the Hilbert-Schmidt independence criterion (HSIC) and the Wilcoxon signed-rank. The proposed method was applied to a large corpus of Spanish tweets produced in 160 locations in Colombia through the analysis of unigram features. The resulting dialectones showed to be meaningful but difficult to compare against regions identified by other authors using classical dialectometry. We concluded that the automatic detection of dialectones is convenient alternative to classical methods in dialectometry and a potential source of information for automatic language applications.

Original language	English
Pages (from-to)	1213-1222
Number of pages	10
Journal	Computacion y Sistemas
Volume	22
Issue number	4
DOIs	https://doi.org/10.13053/CyS-22-4-3104 https://doi.org/10.13053/CyS-22-4-3104
State	Published - 1 Jan 2018

Keywords

Corpus-based dialectometry
Dialectometry
Dialectone
Ecotone
Hilbert-Schmidt independence criterion
Nonparametric method
Wilcoxon signed-rank test

Access to Document

Cite this

@article{e30b24ec94f84929b1ef9c4f35a3c428,

title = "Dialectones: Finding statistically significant dialectal boundaries using twitter data",

abstract = "Most NLP applications assume that a particular language is homogeneous in the regions where it is spoken. However, each language varies considerably throughout its geographical distribution. To make NLP sensitive to dialects, a reliable, representative and up-to-date source of information that quantitatively represents such geographical variation is necessary. However, some of the current approaches have disadvantages such as the need for parameters, the disregard of the geographical coordinates in the analysis, and the use of linguistic alternations that presuppose the existence of specific dialectal varieties. Detection of “ecotones” is an analogous problem in the field of ecology that focuses on the identification of boundaries, instead of regions, in ecosystems facilitating the construction of statistical tests. We adapted the concept of “ecotone” to “dialectone” for the detection of dialectal boundaries by using two non-parametric statistical tests: the Hilbert-Schmidt independence criterion (HSIC) and the Wilcoxon signed-rank. The proposed method was applied to a large corpus of Spanish tweets produced in 160 locations in Colombia through the analysis of unigram features. The resulting dialectones showed to be meaningful but difficult to compare against regions identified by other authors using classical dialectometry. We concluded that the automatic detection of dialectones is convenient alternative to classical methods in dialectometry and a potential source of information for automatic language applications.",

keywords = "Corpus-based dialectometry, Dialectometry, Dialectone, Ecotone, Hilbert-Schmidt independence criterion, Nonparametric method, Wilcoxon signed-rank test",

author = "Rodriguez-Diaz, {Carlos A.} and Sergio Jimenez and George Due{\~n}as and Bonilla, {Johnatan Estiven} and Alexander Gelbukh",

year = "2018",

month = jan,

day = "1",

doi = "10.13053/CyS-22-4-3104",

language = "Ingl{\'e}s",

volume = "22",

pages = "1213--1222",

journal = "Computacion y Sistemas",

issn = "1405-5546",

publisher = "Centro de Investigacion en Computacion (CIC) del Instituto Politecnico Nacional (IPN)",

number = "4",

}

TY - JOUR

T1 - Dialectones: Finding statistically significant dialectal boundaries using twitter data

AU - Rodriguez-Diaz, Carlos A.

AU - Jimenez, Sergio

AU - Dueñas, George

AU - Bonilla, Johnatan Estiven

AU - Gelbukh, Alexander

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Most NLP applications assume that a particular language is homogeneous in the regions where it is spoken. However, each language varies considerably throughout its geographical distribution. To make NLP sensitive to dialects, a reliable, representative and up-to-date source of information that quantitatively represents such geographical variation is necessary. However, some of the current approaches have disadvantages such as the need for parameters, the disregard of the geographical coordinates in the analysis, and the use of linguistic alternations that presuppose the existence of specific dialectal varieties. Detection of “ecotones” is an analogous problem in the field of ecology that focuses on the identification of boundaries, instead of regions, in ecosystems facilitating the construction of statistical tests. We adapted the concept of “ecotone” to “dialectone” for the detection of dialectal boundaries by using two non-parametric statistical tests: the Hilbert-Schmidt independence criterion (HSIC) and the Wilcoxon signed-rank. The proposed method was applied to a large corpus of Spanish tweets produced in 160 locations in Colombia through the analysis of unigram features. The resulting dialectones showed to be meaningful but difficult to compare against regions identified by other authors using classical dialectometry. We concluded that the automatic detection of dialectones is convenient alternative to classical methods in dialectometry and a potential source of information for automatic language applications.

AB - Most NLP applications assume that a particular language is homogeneous in the regions where it is spoken. However, each language varies considerably throughout its geographical distribution. To make NLP sensitive to dialects, a reliable, representative and up-to-date source of information that quantitatively represents such geographical variation is necessary. However, some of the current approaches have disadvantages such as the need for parameters, the disregard of the geographical coordinates in the analysis, and the use of linguistic alternations that presuppose the existence of specific dialectal varieties. Detection of “ecotones” is an analogous problem in the field of ecology that focuses on the identification of boundaries, instead of regions, in ecosystems facilitating the construction of statistical tests. We adapted the concept of “ecotone” to “dialectone” for the detection of dialectal boundaries by using two non-parametric statistical tests: the Hilbert-Schmidt independence criterion (HSIC) and the Wilcoxon signed-rank. The proposed method was applied to a large corpus of Spanish tweets produced in 160 locations in Colombia through the analysis of unigram features. The resulting dialectones showed to be meaningful but difficult to compare against regions identified by other authors using classical dialectometry. We concluded that the automatic detection of dialectones is convenient alternative to classical methods in dialectometry and a potential source of information for automatic language applications.

KW - Corpus-based dialectometry

KW - Dialectometry

KW - Dialectone

KW - Ecotone

KW - Hilbert-Schmidt independence criterion

KW - Nonparametric method

KW - Wilcoxon signed-rank test

UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85069591779&origin=inward

UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=85069591779&origin=inward

U2 - 10.13053/CyS-22-4-3104

DO - 10.13053/CyS-22-4-3104

M3 - Artículo

AN - SCOPUS:85069591779

SN - 1405-5546

VL - 22

SP - 1213

EP - 1222

JO - Computacion y Sistemas

JF - Computacion y Sistemas

IS - 4

ER -

Dialectones: Finding statistically significant dialectal boundaries using twitter data

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this