SC spectra: A linear-time soft cardinality approximation for text comparison

Sergio Jiménez Vargas, Alexander Gelbukh

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

Soft cardinality (SC) is a softened version of the classical cardinality of set theory. However, given its prohibitive cost of computing (exponential order), an approximation that is quadratic in the number of terms in the text has been proposed in the past. SC Spectra is a new method of approximation in linear time for text strings, which divides text strings into consecutive substrings (i.e., q-grams) of different sizes. Thus, SC in combination with resemblance coefficients allowed the construction of a family of similarity functions for text comparison. These similarity measures have been used in the past to address a problem of entity resolution (name matching) outperforming SoftTFIDF measure. SC spectra method improves the previous results using less time and obtaining better performance. This allows the new method to be used with relatively large documents such as those included in classic information retrieval collections. SC spectra method exceeded SoftTFIDF and cosine tf-idf baselines with an approach that requires no term weighing.

Original languageEnglish
Title of host publicationAdvances in Soft Computing - 10th Mexican International Conference on Artificial Intelligence, MICAI 2011, Proceedings
Pages213-224
Number of pages12
EditionPART 2
DOIs
StatePublished - 2011
Event10th Mexican International Conference on Artificial Intelligence, MICAI 2011 - Puebla, Mexico
Duration: 26 Nov 20114 Dec 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 2
Volume7095 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference10th Mexican International Conference on Artificial Intelligence, MICAI 2011
Country/TerritoryMexico
CityPuebla
Period26/11/114/12/11

Keywords

  • approximate text comparison
  • ngrams
  • q-grams
  • soft cardinality
  • soft cardinality spectra

Fingerprint

Dive into the research topics of 'SC spectra: A linear-time soft cardinality approximation for text comparison'. Together they form a unique fingerprint.

Cite this