Simple TF·IDF is not the best you can get for regionalism classification

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

4 Citas (Scopus)

Resumen

In broadly spoken languages such as English or Spanish, there are words akin to a particular region. For example, there are words typically used in the UK such as cooker, while stove is preferred for that concept in the US. Identifying the particular words a region cultivates involves discriminating them from the set of common words to all regions. This yields the problem where a term's frequency should be salient enough to be considered of importance, while being a common term tames this salience. This is the known problem of Term Frequency versus the Inverse Document Frequency; nevertheless, typical TF·IDF applications do not include weighting factors. In this work we propose several alternative formulae empirically, and then we conclude that we need to dig in a broader search space; thereby, we propose using Genetic Programming to find a suitable expression composed of TF and IDF terms that maximizes the discrimination of such terms given a reduced bootstrapping set of examples labeled for each region (400). We present performance examples for the Spanish variations across the Americas and Spain.

Idioma originalInglés
Título de la publicación alojadaComputational Linguistics and Intelligent Text Processing - 15th International Conference, CICLing 2014, Proceedings
EditorialSpringer Verlag
Páginas92-101
Número de páginas10
EdiciónPART 1
ISBN (versión impresa)9783642549052
DOI
EstadoPublicada - 2014
Evento15th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2014 - Kathmandu, Nepal
Duración: 6 abr. 201412 abr. 2014

Serie de la publicación

NombreLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NúmeroPART 1
Volumen8403 LNCS
ISSN (versión impresa)0302-9743
ISSN (versión digital)1611-3349

Conferencia

Conferencia15th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2014
País/TerritorioNepal
CiudadKathmandu
Período6/04/1412/04/14

Huella

Profundice en los temas de investigación de 'Simple TF·IDF is not the best you can get for regionalism classification'. En conjunto forman una huella única.

Citar esto