The winning approach to text alignment for text reuse detection at PAN 2014: Notebook for PAN at CLEF 2014

Miguel A. Sanchez-Perez, Grigori Sidorov, Alexander Gelbukh

Producción científica: Contribución a una revistaArtículo de la conferenciarevisión exhaustiva

33 Citas (Scopus)

Resumen

The task of (monolingual) text alignment consists in finding similar text fragments between two given documents. It has applications in plagiarism detection, detection of text reuse, author identification, authoring aid, and information retrieval, to mention only a few. We describe our approach to the text alignment subtask at the plagiarism detection competition of PAN 2014. Our method relies on a sentence similarity measure based on a tf-idf-like weighting scheme that permits us to keep stopwords without increasing the rate of false positives. We introduce a recursive algorithm to extend the matching sentences to maximal length passages. We also introduce a novel filtering method to resolve overlapping plagiarism cases. By the cumulative measure (Plagdet), our approach outperforms the best-performing system of the PAN 2013 competition and resulted in the best-performing system at the PAN 2014 competition. Our system is publicly available in open-source form.

Idioma originalInglés
Páginas (desde-hasta)1004-1011
Número de páginas8
PublicaciónCEUR Workshop Proceedings
Volumen1180
EstadoPublicada - 2014
Evento2014 Cross Language Evaluation Forum Conference, CLEF 2014 - Sheffield, Reino Unido
Duración: 15 sep. 201418 sep. 2014

Huella

Profundice en los temas de investigación de 'The winning approach to text alignment for text reuse detection at PAN 2014: Notebook for PAN at CLEF 2014'. En conjunto forman una huella única.

Citar esto