TY - GEN
T1 - Hierarchical clustering analysis
T2 - 9th International Conference of the CLEF Association, CLEF 2018
AU - Gómez-Adorno, Helena
AU - Martín-Del-Campo-Rodríguez, Carolina
AU - Sidorov, Grigori
AU - Alemán, Yuridiana
AU - Vilariño, Darnes
AU - Pinto, David
N1 - Publisher Copyright:
© Springer Nature Switzerland AG 2018.
PY - 2018
Y1 - 2018
N2 - The author clustering problem consists in grouping documents written by the same author so that each group corresponds to a different author. We described our approach to the author clustering task at PAN 2017, which resulted in the best-performing system at the aforementioned task. Our method performs a hierarchical clustering analysis using document features such as typed and untyped character n-grams, word n-grams, and stylometric features. We experimented with two feature representation methods, log-entropy model, and TF-IDF, while tuning minimum frequency threshold values to reduce the feature dimensionality. We identified the optimal number of different clusters (authors) dynamically for each collection using the Caliński Harabasz score. The implementation of our system is available open source (https://github.com/helenpy/clusterPAN2017).
AB - The author clustering problem consists in grouping documents written by the same author so that each group corresponds to a different author. We described our approach to the author clustering task at PAN 2017, which resulted in the best-performing system at the aforementioned task. Our method performs a hierarchical clustering analysis using document features such as typed and untyped character n-grams, word n-grams, and stylometric features. We experimented with two feature representation methods, log-entropy model, and TF-IDF, while tuning minimum frequency threshold values to reduce the feature dimensionality. We identified the optimal number of different clusters (authors) dynamically for each collection using the Caliński Harabasz score. The implementation of our system is available open source (https://github.com/helenpy/clusterPAN2017).
KW - Author clustering
KW - Authorship-link ranking
KW - Hierarchical clustering
UR - http://www.scopus.com/inward/record.url?scp=85052823564&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-98932-7_20
DO - 10.1007/978-3-319-98932-7_20
M3 - Contribución a la conferencia
SN - 9783319989310
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 216
EP - 223
BT - Experimental IR Meets Multilinguality, Multimodality, and Interaction - 9th International Conference of the CLEF Association, CLEF 2018, Proceedings
A2 - SanJuan, Eric
A2 - Murtagh, Fionn
A2 - Nie, Jian Yun
A2 - Soulier, Laure
A2 - Cappellato, Linda
A2 - Bellot, Patrice
A2 - Mothe, Josiane
A2 - Trabelsi, Chiraz
A2 - Ferro, Nicola
PB - Springer Verlag
Y2 - 10 September 2018 through 14 September 2018
ER -