Dependency vs. constituent based syntactic N-grams in text similarity measures for paraphrase recognition

Hiram Calvo; Andrea Segura-Olivares; Alejandro García

doi:10.13053/CyS-18-3-2044

Dependency vs. constituent based syntactic N-grams in text similarity measures for paraphrase recognition

Hiram Calvo, Andrea Segura-Olivares, Alejandro García

Centro de Investigación en Computación (CIC)

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

7 Citas (Scopus)

Resumen

Paraphrase recognition consists in detecting if an expression restated as another expression contains the same information. Traditionally, for solving this problem, several lexical, syntactic and semantic based techniques are used. For measuring word overlapping, most of the works use n-grams; however syntactic n-grams have been scantily explored. We propose using syntactic dependency and constituent n-grams combined with common NLP techniques such as stemming, synonym detection, similarity measures, and linear combination and a similarity matrix built in turn from syntactic ngrams. We measure and compare the performance of our system by using the Microsoft Research Paraphrase Corpus. An in-depth research is presented in order to present the strengths and weaknesses of each approach, as well as a common error analysis section. Our main motivation was to determine which syntactic approach had a better performance for this task: syntactic dependency n-grams, or syntactic constituent ngrams. We compare too both approaches with traditional n-grams and state-of-the-art systems.

Idioma original	Inglés
Páginas (desde-hasta)	517-554
Número de páginas	38
Publicación	Computacion y Sistemas
Volumen	18
N.º	3
DOI	https://doi.org/10.13053/CyS-18-3-2044
Estado	Publicada - 1 jul. 2014

Acceder al documento

10.13053/CyS-18-3-2044

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{ec00e30835c345e79af03278e1ea40d4,

title = "Dependency vs. constituent based syntactic N-grams in text similarity measures for paraphrase recognition",

abstract = "Paraphrase recognition consists in detecting if an expression restated as another expression contains the same information. Traditionally, for solving this problem, several lexical, syntactic and semantic based techniques are used. For measuring word overlapping, most of the works use n-grams; however syntactic n-grams have been scantily explored. We propose using syntactic dependency and constituent n-grams combined with common NLP techniques such as stemming, synonym detection, similarity measures, and linear combination and a similarity matrix built in turn from syntactic ngrams. We measure and compare the performance of our system by using the Microsoft Research Paraphrase Corpus. An in-depth research is presented in order to present the strengths and weaknesses of each approach, as well as a common error analysis section. Our main motivation was to determine which syntactic approach had a better performance for this task: syntactic dependency n-grams, or syntactic constituent ngrams. We compare too both approaches with traditional n-grams and state-of-the-art systems.",

keywords = "Constituent analysis, Dependency analysis, Microsoft Research paraphrase corpus, Paraphrase recognition, Similarity measures, Syntactic ngrams",

author = "Hiram Calvo and Andrea Segura-Olivares and Alejandro Garc{\'i}a",

year = "2014",

month = jul,

day = "1",

doi = "10.13053/CyS-18-3-2044",

language = "Ingl{\'e}s",

volume = "18",

pages = "517--554",

journal = "Computacion y Sistemas",

issn = "1405-5546",

number = "3",

}

TY - JOUR

T1 - Dependency vs. constituent based syntactic N-grams in text similarity measures for paraphrase recognition

AU - Calvo, Hiram

AU - Segura-Olivares, Andrea

AU - García, Alejandro

PY - 2014/7/1

Y1 - 2014/7/1

N2 - Paraphrase recognition consists in detecting if an expression restated as another expression contains the same information. Traditionally, for solving this problem, several lexical, syntactic and semantic based techniques are used. For measuring word overlapping, most of the works use n-grams; however syntactic n-grams have been scantily explored. We propose using syntactic dependency and constituent n-grams combined with common NLP techniques such as stemming, synonym detection, similarity measures, and linear combination and a similarity matrix built in turn from syntactic ngrams. We measure and compare the performance of our system by using the Microsoft Research Paraphrase Corpus. An in-depth research is presented in order to present the strengths and weaknesses of each approach, as well as a common error analysis section. Our main motivation was to determine which syntactic approach had a better performance for this task: syntactic dependency n-grams, or syntactic constituent ngrams. We compare too both approaches with traditional n-grams and state-of-the-art systems.

AB - Paraphrase recognition consists in detecting if an expression restated as another expression contains the same information. Traditionally, for solving this problem, several lexical, syntactic and semantic based techniques are used. For measuring word overlapping, most of the works use n-grams; however syntactic n-grams have been scantily explored. We propose using syntactic dependency and constituent n-grams combined with common NLP techniques such as stemming, synonym detection, similarity measures, and linear combination and a similarity matrix built in turn from syntactic ngrams. We measure and compare the performance of our system by using the Microsoft Research Paraphrase Corpus. An in-depth research is presented in order to present the strengths and weaknesses of each approach, as well as a common error analysis section. Our main motivation was to determine which syntactic approach had a better performance for this task: syntactic dependency n-grams, or syntactic constituent ngrams. We compare too both approaches with traditional n-grams and state-of-the-art systems.

KW - Constituent analysis

KW - Dependency analysis

KW - Microsoft Research paraphrase corpus

KW - Paraphrase recognition

KW - Similarity measures

KW - Syntactic ngrams

UR - http://www.scopus.com/inward/record.url?scp=84907507271&partnerID=8YFLogxK

U2 - 10.13053/CyS-18-3-2044

DO - 10.13053/CyS-18-3-2044

M3 - Artículo

SN - 1405-5546

VL - 18

SP - 517

EP - 554

JO - Computacion y Sistemas

JF - Computacion y Sistemas

IS - 3

ER -

Dependency vs. constituent based syntactic N-grams in text similarity measures for paraphrase recognition

Resumen

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto