Selection of Representative Documents for Clusters in a Document Collection

Alexander Gelbukh; Mikhail Alexandrov; Ales Bourek; Pavel Makagonov

Selection of Representative Documents for Clusters in a Document Collection

Alexander Gelbukh, Mikhail Alexandrov, Ales Bourek, Pavel Makagonov

Centro de Investigación en Computación (CIC)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

7 Scopus citations

Abstract

An efficient way to explore a large document collection (e.g., the search results returned by a search engine) is to subdivide it into clusters of relatively similar documents, to get a general view of the collection and select its parts of particular interest. A way of presenting the clusters to the user is selection of a document in each cluster. For different purposes this can be done in different ways. We consider three cases: selection of the average, the “most typical,” and the “least typical” document. The algorithms are given, which rely on a dictionary of keywords reflecting the topic of the user's interest. After clustering, we select a document in each cluster basing on its closeness to the other ones. Different distance measures are discussed; preliminary experimental results are presented. Our approach was implemented in the new version of Document Classifier system.

Original language	English
Title of host publication	Natural Language Processing and Information Systems, 8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003
Editors	Antje Dusterhoft, Bernhard Thalheim
Publisher	Gesellschaft fur Informatik (GI)
Pages	120-126
Number of pages	7
ISBN (Electronic)	388579358X
State	Published - 2003
Event	8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003 - Burg, Germany Duration: 23 Jun 2003 → 25 Jun 2003

Publication series

Name	Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)
Volume	P-29
ISSN (Print)	1617-5468

Conference

Conference	8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003
Country/Territory	Germany
City	Burg
Period	23/06/03 → 25/06/03

Cite this

Gelbukh, A., Alexandrov, M., Bourek, A., & Makagonov, P. (2003). Selection of Representative Documents for Clusters in a Document Collection. In A. Dusterhoft, & B. Thalheim (Eds.), Natural Language Processing and Information Systems, 8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003 (pp. 120-126). (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI); Vol. P-29). Gesellschaft fur Informatik (GI).

Gelbukh, Alexander ; Alexandrov, Mikhail ; Bourek, Ales et al. / Selection of Representative Documents for Clusters in a Document Collection. Natural Language Processing and Information Systems, 8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003. editor / Antje Dusterhoft ; Bernhard Thalheim. Gesellschaft fur Informatik (GI), 2003. pp. 120-126 (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)).

@inproceedings{abc9dc04b424460f8e9f5c9431a839b5,

title = "Selection of Representative Documents for Clusters in a Document Collection",

abstract = "An efficient way to explore a large document collection (e.g., the search results returned by a search engine) is to subdivide it into clusters of relatively similar documents, to get a general view of the collection and select its parts of particular interest. A way of presenting the clusters to the user is selection of a document in each cluster. For different purposes this can be done in different ways. We consider three cases: selection of the average, the “most typical,” and the “least typical” document. The algorithms are given, which rely on a dictionary of keywords reflecting the topic of the user's interest. After clustering, we select a document in each cluster basing on its closeness to the other ones. Different distance measures are discussed; preliminary experimental results are presented. Our approach was implemented in the new version of Document Classifier system.",

author = "Alexander Gelbukh and Mikhail Alexandrov and Ales Bourek and Pavel Makagonov",

note = "Publisher Copyright: {\textcopyright} 2003 Gesellschaft fur Informatik (GI). All rights reserved.; 8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003 ; Conference date: 23-06-2003 Through 25-06-2003",

year = "2003",

language = "Ingl{\'e}s",

series = "Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)",

publisher = "Gesellschaft fur Informatik (GI)",

pages = "120--126",

editor = "Antje Dusterhoft and Bernhard Thalheim",

booktitle = "Natural Language Processing and Information Systems, 8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003",

}

Gelbukh, A, Alexandrov, M, Bourek, A & Makagonov, P 2003, Selection of Representative Documents for Clusters in a Document Collection. in A Dusterhoft & B Thalheim (eds), Natural Language Processing and Information Systems, 8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003. Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI), vol. P-29, Gesellschaft fur Informatik (GI), pp. 120-126, 8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003, Burg, Germany, 23/06/03.

Selection of Representative Documents for Clusters in a Document Collection. / Gelbukh, Alexander; Alexandrov, Mikhail; Bourek, Ales et al.
Natural Language Processing and Information Systems, 8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003. ed. / Antje Dusterhoft; Bernhard Thalheim. Gesellschaft fur Informatik (GI), 2003. p. 120-126 (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI); Vol. P-29).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Selection of Representative Documents for Clusters in a Document Collection

AU - Gelbukh, Alexander

AU - Alexandrov, Mikhail

AU - Bourek, Ales

AU - Makagonov, Pavel

PY - 2003

Y1 - 2003

N2 - An efficient way to explore a large document collection (e.g., the search results returned by a search engine) is to subdivide it into clusters of relatively similar documents, to get a general view of the collection and select its parts of particular interest. A way of presenting the clusters to the user is selection of a document in each cluster. For different purposes this can be done in different ways. We consider three cases: selection of the average, the “most typical,” and the “least typical” document. The algorithms are given, which rely on a dictionary of keywords reflecting the topic of the user's interest. After clustering, we select a document in each cluster basing on its closeness to the other ones. Different distance measures are discussed; preliminary experimental results are presented. Our approach was implemented in the new version of Document Classifier system.

AB - An efficient way to explore a large document collection (e.g., the search results returned by a search engine) is to subdivide it into clusters of relatively similar documents, to get a general view of the collection and select its parts of particular interest. A way of presenting the clusters to the user is selection of a document in each cluster. For different purposes this can be done in different ways. We consider three cases: selection of the average, the “most typical,” and the “least typical” document. The algorithms are given, which rely on a dictionary of keywords reflecting the topic of the user's interest. After clustering, we select a document in each cluster basing on its closeness to the other ones. Different distance measures are discussed; preliminary experimental results are presented. Our approach was implemented in the new version of Document Classifier system.

UR - http://www.scopus.com/inward/record.url?scp=84971482262&partnerID=8YFLogxK

M3 - Contribución a la conferencia

AN - SCOPUS:84971482262

T3 - Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)

SP - 120

EP - 126

BT - Natural Language Processing and Information Systems, 8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003

A2 - Dusterhoft, Antje

A2 - Thalheim, Bernhard

PB - Gesellschaft fur Informatik (GI)

T2 - 8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003

Y2 - 23 June 2003 through 25 June 2003

ER -

Gelbukh A, Alexandrov M, Bourek A, Makagonov P. Selection of Representative Documents for Clusters in a Document Collection. In Dusterhoft A, Thalheim B, editors, Natural Language Processing and Information Systems, 8th International Conference on Applications of Natural Language to Information Systems, NLDB 2003. Gesellschaft fur Informatik (GI). 2003. p. 120-126. (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)).

Selection of Representative Documents for Clusters in a Document Collection

Abstract

Publication series

Conference

Other files and links

Fingerprint

Cite this