Selection of Representative Documents for Clusters in a Document Collection

Alexander Gelbukh, Mikhail Alexandrov, Ales Bourek, Pavel Makagonov

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

7 Scopus citations

Abstract

An efficient way to explore a large document collection (e.g., the search results returned by a search engine) is to subdivide it into clusters of relatively similar documents, to get a general view of the collection and select its parts of particular interest. A way of presenting the clusters to the user is selection of a document in each cluster. For different purposes this can be done in different ways. We consider three cases: selection of the average, the “most typical,” and the “least typical” document. The algorithms are given, which rely on a dictionary of keywords reflecting the topic of the user's interest. After clustering, we select a document in each cluster basing on its closeness to the other ones. Different distance measures are discussed; preliminary experimental results are presented. Our approach was implemented in the new version of Document Classifier system.

Original languageEnglish
Title of host publicationNatural Language Processing and Information Systems, 8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003
EditorsAntje Dusterhoft, Bernhard Thalheim
PublisherGesellschaft fur Informatik (GI)
Pages120-126
Number of pages7
ISBN (Electronic)388579358X
StatePublished - 2003
Event8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003 - Burg, Germany
Duration: 23 Jun 200325 Jun 2003

Publication series

NameLecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)
VolumeP-29
ISSN (Print)1617-5468

Conference

Conference8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003
Country/TerritoryGermany
CityBurg
Period23/06/0325/06/03

Fingerprint

Dive into the research topics of 'Selection of Representative Documents for Clusters in a Document Collection'. Together they form a unique fingerprint.

Cite this