TY - GEN
T1 - Simple TF·IDF is not the best you can get for regionalism classification
AU - Calvo, Hiram
PY - 2014
Y1 - 2014
N2 - In broadly spoken languages such as English or Spanish, there are words akin to a particular region. For example, there are words typically used in the UK such as cooker, while stove is preferred for that concept in the US. Identifying the particular words a region cultivates involves discriminating them from the set of common words to all regions. This yields the problem where a term's frequency should be salient enough to be considered of importance, while being a common term tames this salience. This is the known problem of Term Frequency versus the Inverse Document Frequency; nevertheless, typical TF·IDF applications do not include weighting factors. In this work we propose several alternative formulae empirically, and then we conclude that we need to dig in a broader search space; thereby, we propose using Genetic Programming to find a suitable expression composed of TF and IDF terms that maximizes the discrimination of such terms given a reduced bootstrapping set of examples labeled for each region (400). We present performance examples for the Spanish variations across the Americas and Spain.
AB - In broadly spoken languages such as English or Spanish, there are words akin to a particular region. For example, there are words typically used in the UK such as cooker, while stove is preferred for that concept in the US. Identifying the particular words a region cultivates involves discriminating them from the set of common words to all regions. This yields the problem where a term's frequency should be salient enough to be considered of importance, while being a common term tames this salience. This is the known problem of Term Frequency versus the Inverse Document Frequency; nevertheless, typical TF·IDF applications do not include weighting factors. In this work we propose several alternative formulae empirically, and then we conclude that we need to dig in a broader search space; thereby, we propose using Genetic Programming to find a suitable expression composed of TF and IDF terms that maximizes the discrimination of such terms given a reduced bootstrapping set of examples labeled for each region (400). We present performance examples for the Spanish variations across the Americas and Spain.
KW - Bootstrapping
KW - Genetic Programming
KW - Regionalisms
KW - TF·IDF
UR - http://www.scopus.com/inward/record.url?scp=84958528783&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-54906-9_8
DO - 10.1007/978-3-642-54906-9_8
M3 - Contribución a la conferencia
SN - 9783642549052
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 92
EP - 101
BT - Computational Linguistics and Intelligent Text Processing - 15th International Conference, CICLing 2014, Proceedings
PB - Springer Verlag
T2 - 15th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2014
Y2 - 6 April 2014 through 12 April 2014
ER -