Selection of representative documents in a document collection

Pavel Makagonov; Mikhail Alexandrov; Alexander Gelbukh

Selection of representative documents in a document collection

Pavel Makagonov, Mikhail Alexandrov, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Research output: Chapter in Book/Report/Conference proceeding › Chapter › peer-review

5 Scopus citations

Abstract

In different situations, different documents can be selected as representative ones of document groups found in a large document set. We consider three different problems of this kind - selection of the average document, the "most typical" document and the "least typical" one, giving the corresponding algorithms. These tasks are considered in the framework of a given topic defined by a domain-oriented keyword dictionary. The procedure consists of two phases: (1) clustering documents into sub-topics and (2) definition of the representative document in each group. For the latter, the notions of potential and difference of potentials are introduced, which are applied to the dendrite constructed by the method of the nearest neighbor. Unlike the traditional clustering on dendrite, the potentials allow to take into account the structure of connections in significantly greater detail. For approach has been implemented in a new version of the system Text Classifier.

Original language	English
Title of host publication	Advances in Communications and Software Technologies
Publisher	World Scientific and Engineering Academy and Society
Pages	197-202
Number of pages	6
ISBN (Print)	9608052718
State	Published - 2002

Keywords

Clustering
Dendrite
Document Categorization
Natural Language Processing
Potential

Cite this

@inbook{32c49fb144474d5fb8b62682bae7d97b,

title = "Selection of representative documents in a document collection",

abstract = "In different situations, different documents can be selected as representative ones of document groups found in a large document set. We consider three different problems of this kind - selection of the average document, the {"}most typical{"} document and the {"}least typical{"} one, giving the corresponding algorithms. These tasks are considered in the framework of a given topic defined by a domain-oriented keyword dictionary. The procedure consists of two phases: (1) clustering documents into sub-topics and (2) definition of the representative document in each group. For the latter, the notions of potential and difference of potentials are introduced, which are applied to the dendrite constructed by the method of the nearest neighbor. Unlike the traditional clustering on dendrite, the potentials allow to take into account the structure of connections in significantly greater detail. For approach has been implemented in a new version of the system Text Classifier.",

keywords = "Clustering, Dendrite, Document Categorization, Natural Language Processing, Potential",

author = "Pavel Makagonov and Mikhail Alexandrov and Alexander Gelbukh",

year = "2002",

language = "Ingl{\'e}s",

isbn = "9608052718",

pages = "197--202",

booktitle = "Advances in Communications and Software Technologies",

publisher = "World Scientific and Engineering Academy and Society",

address = "Grecia",

}

TY - CHAP

T1 - Selection of representative documents in a document collection

AU - Makagonov, Pavel

AU - Alexandrov, Mikhail

AU - Gelbukh, Alexander

PY - 2002

Y1 - 2002

N2 - In different situations, different documents can be selected as representative ones of document groups found in a large document set. We consider three different problems of this kind - selection of the average document, the "most typical" document and the "least typical" one, giving the corresponding algorithms. These tasks are considered in the framework of a given topic defined by a domain-oriented keyword dictionary. The procedure consists of two phases: (1) clustering documents into sub-topics and (2) definition of the representative document in each group. For the latter, the notions of potential and difference of potentials are introduced, which are applied to the dendrite constructed by the method of the nearest neighbor. Unlike the traditional clustering on dendrite, the potentials allow to take into account the structure of connections in significantly greater detail. For approach has been implemented in a new version of the system Text Classifier.

AB - In different situations, different documents can be selected as representative ones of document groups found in a large document set. We consider three different problems of this kind - selection of the average document, the "most typical" document and the "least typical" one, giving the corresponding algorithms. These tasks are considered in the framework of a given topic defined by a domain-oriented keyword dictionary. The procedure consists of two phases: (1) clustering documents into sub-topics and (2) definition of the representative document in each group. For the latter, the notions of potential and difference of potentials are introduced, which are applied to the dendrite constructed by the method of the nearest neighbor. Unlike the traditional clustering on dendrite, the potentials allow to take into account the structure of connections in significantly greater detail. For approach has been implemented in a new version of the system Text Classifier.

KW - Clustering

KW - Dendrite

KW - Document Categorization

KW - Natural Language Processing

KW - Potential

UR - http://www.scopus.com/inward/record.url?scp=4944230807&partnerID=8YFLogxK

M3 - Capítulo

AN - SCOPUS:4944230807

SN - 9608052718

SP - 197

EP - 202

BT - Advances in Communications and Software Technologies

PB - World Scientific and Engineering Academy and Society

ER -

Selection of representative documents in a document collection

Abstract

Keywords

Other files and links

Fingerprint

Cite this