Hierarchical clustering analysis: The best-performing approach at PAN 2017 author clustering task

Helena Gómez-Adorno, Carolina Martín-Del-Campo-Rodríguez, Grigori Sidorov, Yuridiana Alemán, Darnes Vilariño, David Pinto

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

10 Citas (Scopus)

Resumen

The author clustering problem consists in grouping documents written by the same author so that each group corresponds to a different author. We described our approach to the author clustering task at PAN 2017, which resulted in the best-performing system at the aforementioned task. Our method performs a hierarchical clustering analysis using document features such as typed and untyped character n-grams, word n-grams, and stylometric features. We experimented with two feature representation methods, log-entropy model, and TF-IDF, while tuning minimum frequency threshold values to reduce the feature dimensionality. We identified the optimal number of different clusters (authors) dynamically for each collection using the Caliński Harabasz score. The implementation of our system is available open source (https://github.com/helenpy/clusterPAN2017).

Idioma originalInglés
Título de la publicación alojadaExperimental IR Meets Multilinguality, Multimodality, and Interaction - 9th International Conference of the CLEF Association, CLEF 2018, Proceedings
EditoresEric SanJuan, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Linda Cappellato, Patrice Bellot, Josiane Mothe, Chiraz Trabelsi, Nicola Ferro
EditorialSpringer Verlag
Páginas216-223
Número de páginas8
ISBN (versión impresa)9783319989310
DOI
EstadoPublicada - 2018
Evento9th International Conference of the CLEF Association, CLEF 2018 - Avignon, Francia
Duración: 10 sep. 201814 sep. 2018

Serie de la publicación

NombreLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volumen11018 LNCS
ISSN (versión impresa)0302-9743
ISSN (versión digital)1611-3349

Conferencia

Conferencia9th International Conference of the CLEF Association, CLEF 2018
País/TerritorioFrancia
CiudadAvignon
Período10/09/1814/09/18

Huella

Profundice en los temas de investigación de 'Hierarchical clustering analysis: The best-performing approach at PAN 2017 author clustering task'. En conjunto forman una huella única.

Citar esto