Soft similarity and soft cosine measure: Similarity of features in vector space model

Grigori Sidorov, Alexander Gelbukh, Helena Gómez-Adorno, David Pinto

Research output: Contribution to journalArticle

137 Citations (Scopus)

Abstract

We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictionary) and does not need to be learned from the data. We call the proposed similarity measure soft similarity. Similarity between features is common, for example, in natural language processing: words, n-grams, or syntactic n-grams can be somewhat different (which makes them different features) but still have much in common: for example, words "play" and "game" are different but related. When there is no similarity between features then our soft similarity measure is equal to the standard similarity. For this, we generalize the well-known cosine similarity measure in VSM by introducing what we call "soft cosine measure". We propose various formulas for exact or approximate calculation of the soft cosine measure. For example, in one of them we consider for VSM a new feature space consisting of pairs of the original features weighted by their similarity. Again, for features that bear no similarity to each other, our formulas reduce to the standard cosine measure. Our experiments show that our soft cosine measure provides better performance in our case study: entrance exams question answering task at CLEF. In these experiments, we use syntactic n-grams as features and Levenshtein distance as the similarity between n-grams, measured either in characters or in elements of n-grams.
Original languageAmerican English
Pages (from-to)491-504
Number of pages440
JournalComputacion y Sistemas
DOIs
StatePublished - 1 Jan 2014

Fingerprint

vector spaces
Vector spaces
Syntactics
Word processing
Glossaries
Learning algorithms
Learning systems
Experiments
natural language processing
dictionaries
machine learning
games
bears
entrances

Cite this

@article{c1961241569646d78af29d1aaf87a6a8,
title = "Soft similarity and soft cosine measure: Similarity of features in vector space model",
abstract = "We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictionary) and does not need to be learned from the data. We call the proposed similarity measure soft similarity. Similarity between features is common, for example, in natural language processing: words, n-grams, or syntactic n-grams can be somewhat different (which makes them different features) but still have much in common: for example, words {"}play{"} and {"}game{"} are different but related. When there is no similarity between features then our soft similarity measure is equal to the standard similarity. For this, we generalize the well-known cosine similarity measure in VSM by introducing what we call {"}soft cosine measure{"}. We propose various formulas for exact or approximate calculation of the soft cosine measure. For example, in one of them we consider for VSM a new feature space consisting of pairs of the original features weighted by their similarity. Again, for features that bear no similarity to each other, our formulas reduce to the standard cosine measure. Our experiments show that our soft cosine measure provides better performance in our case study: entrance exams question answering task at CLEF. In these experiments, we use syntactic n-grams as features and Levenshtein distance as the similarity between n-grams, measured either in characters or in elements of n-grams.",
author = "Grigori Sidorov and Alexander Gelbukh and Helena G{\'o}mez-Adorno and David Pinto",
year = "2014",
month = "1",
day = "1",
doi = "10.13053/CyS-18-3-2043",
language = "American English",
pages = "491--504",
journal = "Computacion y Sistemas",
issn = "1405-5546",
publisher = "Centro de Investigacion en Computacion (CIC) del Instituto Politecnico Nacional (IPN)",

}

Soft similarity and soft cosine measure: Similarity of features in vector space model. / Sidorov, Grigori; Gelbukh, Alexander; Gómez-Adorno, Helena; Pinto, David.

In: Computacion y Sistemas, 01.01.2014, p. 491-504.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Soft similarity and soft cosine measure: Similarity of features in vector space model

AU - Sidorov, Grigori

AU - Gelbukh, Alexander

AU - Gómez-Adorno, Helena

AU - Pinto, David

PY - 2014/1/1

Y1 - 2014/1/1

N2 - We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictionary) and does not need to be learned from the data. We call the proposed similarity measure soft similarity. Similarity between features is common, for example, in natural language processing: words, n-grams, or syntactic n-grams can be somewhat different (which makes them different features) but still have much in common: for example, words "play" and "game" are different but related. When there is no similarity between features then our soft similarity measure is equal to the standard similarity. For this, we generalize the well-known cosine similarity measure in VSM by introducing what we call "soft cosine measure". We propose various formulas for exact or approximate calculation of the soft cosine measure. For example, in one of them we consider for VSM a new feature space consisting of pairs of the original features weighted by their similarity. Again, for features that bear no similarity to each other, our formulas reduce to the standard cosine measure. Our experiments show that our soft cosine measure provides better performance in our case study: entrance exams question answering task at CLEF. In these experiments, we use syntactic n-grams as features and Levenshtein distance as the similarity between n-grams, measured either in characters or in elements of n-grams.

AB - We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictionary) and does not need to be learned from the data. We call the proposed similarity measure soft similarity. Similarity between features is common, for example, in natural language processing: words, n-grams, or syntactic n-grams can be somewhat different (which makes them different features) but still have much in common: for example, words "play" and "game" are different but related. When there is no similarity between features then our soft similarity measure is equal to the standard similarity. For this, we generalize the well-known cosine similarity measure in VSM by introducing what we call "soft cosine measure". We propose various formulas for exact or approximate calculation of the soft cosine measure. For example, in one of them we consider for VSM a new feature space consisting of pairs of the original features weighted by their similarity. Again, for features that bear no similarity to each other, our formulas reduce to the standard cosine measure. Our experiments show that our soft cosine measure provides better performance in our case study: entrance exams question answering task at CLEF. In these experiments, we use syntactic n-grams as features and Levenshtein distance as the similarity between n-grams, measured either in characters or in elements of n-grams.

UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84907532699&origin=inward

UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=84907532699&origin=inward

U2 - 10.13053/CyS-18-3-2043

DO - 10.13053/CyS-18-3-2043

M3 - Article

SP - 491

EP - 504

JO - Computacion y Sistemas

JF - Computacion y Sistemas

SN - 1405-5546

ER -