Simple TF·IDF is not the best you can get for regionalism classification

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

In broadly spoken languages such as English or Spanish, there are words akin to a particular region. For example, there are words typically used in the UK such as cooker, while stove is preferred for that concept in the US. Identifying the particular words a region cultivates involves discriminating them from the set of common words to all regions. This yields the problem where a term's frequency should be salient enough to be considered of importance, while being a common term tames this salience. This is the known problem of Term Frequency versus the Inverse Document Frequency; nevertheless, typical TF·IDF applications do not include weighting factors. In this work we propose several alternative formulae empirically, and then we conclude that we need to dig in a broader search space; thereby, we propose using Genetic Programming to find a suitable expression composed of TF and IDF terms that maximizes the discrimination of such terms given a reduced bootstrapping set of examples labeled for each region (400). We present performance examples for the Spanish variations across the Americas and Spain.

Original languageEnglish
Title of host publicationComputational Linguistics and Intelligent Text Processing - 15th International Conference, CICLing 2014, Proceedings
PublisherSpringer Verlag
Pages92-101
Number of pages10
EditionPART 1
ISBN (Print)9783642549052
DOIs
StatePublished - 2014
Event15th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2014 - Kathmandu, Nepal
Duration: 6 Apr 201412 Apr 2014

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 1
Volume8403 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference15th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2014
Country/TerritoryNepal
CityKathmandu
Period6/04/1412/04/14

Keywords

  • Bootstrapping
  • Genetic Programming
  • Regionalisms
  • TF·IDF

Fingerprint

Dive into the research topics of 'Simple TF·IDF is not the best you can get for regionalism classification'. Together they form a unique fingerprint.

Cite this