Hierarchical clustering analysis: The best-performing approach at PAN 2017 author clustering task

Helena Gómez-Adorno, Carolina Martín-Del-Campo-Rodríguez, Grigori Sidorov, Yuridiana Alemán, Darnes Vilariño, David Pinto

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Scopus citations

Abstract

The author clustering problem consists in grouping documents written by the same author so that each group corresponds to a different author. We described our approach to the author clustering task at PAN 2017, which resulted in the best-performing system at the aforementioned task. Our method performs a hierarchical clustering analysis using document features such as typed and untyped character n-grams, word n-grams, and stylometric features. We experimented with two feature representation methods, log-entropy model, and TF-IDF, while tuning minimum frequency threshold values to reduce the feature dimensionality. We identified the optimal number of different clusters (authors) dynamically for each collection using the Caliński Harabasz score. The implementation of our system is available open source (https://github.com/helenpy/clusterPAN2017).

Original languageEnglish
Title of host publicationExperimental IR Meets Multilinguality, Multimodality, and Interaction - 9th International Conference of the CLEF Association, CLEF 2018, Proceedings
EditorsEric SanJuan, Fionn Murtagh, Jian Yun Nie, Laure Soulier, Linda Cappellato, Patrice Bellot, Josiane Mothe, Chiraz Trabelsi, Nicola Ferro
PublisherSpringer Verlag
Pages216-223
Number of pages8
ISBN (Print)9783319989310
DOIs
StatePublished - 2018
Event9th International Conference of the CLEF Association, CLEF 2018 - Avignon, France
Duration: 10 Sep 201814 Sep 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11018 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference9th International Conference of the CLEF Association, CLEF 2018
Country/TerritoryFrance
CityAvignon
Period10/09/1814/09/18

Keywords

  • Author clustering
  • Authorship-link ranking
  • Hierarchical clustering

Fingerprint

Dive into the research topics of 'Hierarchical clustering analysis: The best-performing approach at PAN 2017 author clustering task'. Together they form a unique fingerprint.

Cite this