Baselines for natural language processing tasks based on soft cardinality spectra

Sergio Jimenez, Alexander Gelbukh

Producción científica: Contribución a una revistaArtículorevisión exhaustiva

7 Citas (Scopus)

Resumen

Soft-cardinality spectra (SC spectra) is a new method of approximation for text strings in linear time, which divides text strings into character g-grams of different sizes. The method allows simultaneous use of weighting at term and g-gram levels. SC spectra in combination with resemblance coefficients allows the construction of a family of text similarity functions that only use the surface information of the texts and weights obtained in the same text collection. These similarity measures can be used in various tasks of natural language processing as baseline for other methods that exploit the hidden syntactic and/or semantic structure using resources based on knowledge, inference of large corpora. The proposed method was evaluated in 30 data sets to address the tasks of information retrieval, entity matching, paraphrase and textual entailment recognition. The results raised the bar near to the best published results in the used data sets. We claim that any method that uses any resource or information external to a particular data set should outperform our method. We found that our method is an effective and challenging baseline for the evaluated tasks.

Idioma originalInglés
Páginas (desde-hasta)180-199
Número de páginas20
PublicaciónApplied and Computational Mathematics
Volumen11
N.º2
EstadoPublicada - 2012

Huella

Profundice en los temas de investigación de 'Baselines for natural language processing tasks based on soft cardinality spectra'. En conjunto forman una huella única.

Citar esto