The winning approach to text alignment for text reuse detection at PAN 2014: Notebook for PAN at CLEF 2014

Miguel A. Sanchez-Perez; Grigori Sidorov; Alexander Gelbukh

The winning approach to text alignment for text reuse detection at PAN 2014: Notebook for PAN at CLEF 2014

Miguel A. Sanchez-Perez, Grigori Sidorov, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Producción científica: Contribución a una revista › Artículo de la conferencia › revisión exhaustiva

33 Citas (Scopus)

Resumen

The task of (monolingual) text alignment consists in finding similar text fragments between two given documents. It has applications in plagiarism detection, detection of text reuse, author identification, authoring aid, and information retrieval, to mention only a few. We describe our approach to the text alignment subtask at the plagiarism detection competition of PAN 2014. Our method relies on a sentence similarity measure based on a tf-idf-like weighting scheme that permits us to keep stopwords without increasing the rate of false positives. We introduce a recursive algorithm to extend the matching sentences to maximal length passages. We also introduce a novel filtering method to resolve overlapping plagiarism cases. By the cumulative measure (Plagdet), our approach outperforms the best-performing system of the PAN 2013 competition and resulted in the best-performing system at the PAN 2014 competition. Our system is publicly available in open-source form.

Idioma original	Inglés
Páginas (desde-hasta)	1004-1011
Número de páginas	8
Publicación	CEUR Workshop Proceedings
Volumen	1180
Estado	Publicada - 2014
Evento	2014 Cross Language Evaluation Forum Conference, CLEF 2014 - Sheffield, Reino Unido Duración: 15 sep. 2014 → 18 sep. 2014

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{5760f25486e346199efc0aa2fb9dc2fa,

title = "The winning approach to text alignment for text reuse detection at PAN 2014: Notebook for PAN at CLEF 2014",

abstract = "The task of (monolingual) text alignment consists in finding similar text fragments between two given documents. It has applications in plagiarism detection, detection of text reuse, author identification, authoring aid, and information retrieval, to mention only a few. We describe our approach to the text alignment subtask at the plagiarism detection competition of PAN 2014. Our method relies on a sentence similarity measure based on a tf-idf-like weighting scheme that permits us to keep stopwords without increasing the rate of false positives. We introduce a recursive algorithm to extend the matching sentences to maximal length passages. We also introduce a novel filtering method to resolve overlapping plagiarism cases. By the cumulative measure (Plagdet), our approach outperforms the best-performing system of the PAN 2013 competition and resulted in the best-performing system at the PAN 2014 competition. Our system is publicly available in open-source form.",

author = "Sanchez-Perez, {Miguel A.} and Grigori Sidorov and Alexander Gelbukh",

year = "2014",

language = "Ingl{\'e}s",

volume = "1180",

pages = "1004--1011",

journal = "CEUR Workshop Proceedings",

issn = "1613-0073",

publisher = "CEUR-WS",

note = "2014 Cross Language Evaluation Forum Conference, CLEF 2014 ; Conference date: 15-09-2014 Through 18-09-2014",

}

TY - JOUR

T1 - The winning approach to text alignment for text reuse detection at PAN 2014

T2 - 2014 Cross Language Evaluation Forum Conference, CLEF 2014

AU - Sanchez-Perez, Miguel A.

AU - Sidorov, Grigori

AU - Gelbukh, Alexander

PY - 2014

Y1 - 2014

N2 - The task of (monolingual) text alignment consists in finding similar text fragments between two given documents. It has applications in plagiarism detection, detection of text reuse, author identification, authoring aid, and information retrieval, to mention only a few. We describe our approach to the text alignment subtask at the plagiarism detection competition of PAN 2014. Our method relies on a sentence similarity measure based on a tf-idf-like weighting scheme that permits us to keep stopwords without increasing the rate of false positives. We introduce a recursive algorithm to extend the matching sentences to maximal length passages. We also introduce a novel filtering method to resolve overlapping plagiarism cases. By the cumulative measure (Plagdet), our approach outperforms the best-performing system of the PAN 2013 competition and resulted in the best-performing system at the PAN 2014 competition. Our system is publicly available in open-source form.

AB - The task of (monolingual) text alignment consists in finding similar text fragments between two given documents. It has applications in plagiarism detection, detection of text reuse, author identification, authoring aid, and information retrieval, to mention only a few. We describe our approach to the text alignment subtask at the plagiarism detection competition of PAN 2014. Our method relies on a sentence similarity measure based on a tf-idf-like weighting scheme that permits us to keep stopwords without increasing the rate of false positives. We introduce a recursive algorithm to extend the matching sentences to maximal length passages. We also introduce a novel filtering method to resolve overlapping plagiarism cases. By the cumulative measure (Plagdet), our approach outperforms the best-performing system of the PAN 2013 competition and resulted in the best-performing system at the PAN 2014 competition. Our system is publicly available in open-source form.

UR - http://www.scopus.com/inward/record.url?scp=84907510937&partnerID=8YFLogxK

M3 - Artículo de la conferencia

AN - SCOPUS:84907510937

SN - 1613-0073

VL - 1180

SP - 1004

EP - 1011

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

Y2 - 15 September 2014 through 18 September 2014

ER -

The winning approach to text alignment for text reuse detection at PAN 2014: Notebook for PAN at CLEF 2014

Resumen

Otros archivos y enlaces

Huella

Citar esto