Selection of representative documents in a document collection

Pavel Makagonov, Mikhail Alexandrov, Alexander Gelbukh

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

5 Scopus citations

Abstract

In different situations, different documents can be selected as representative ones of document groups found in a large document set. We consider three different problems of this kind - selection of the average document, the "most typical" document and the "least typical" one, giving the corresponding algorithms. These tasks are considered in the framework of a given topic defined by a domain-oriented keyword dictionary. The procedure consists of two phases: (1) clustering documents into sub-topics and (2) definition of the representative document in each group. For the latter, the notions of potential and difference of potentials are introduced, which are applied to the dendrite constructed by the method of the nearest neighbor. Unlike the traditional clustering on dendrite, the potentials allow to take into account the structure of connections in significantly greater detail. For approach has been implemented in a new version of the system Text Classifier.

Original languageEnglish
Title of host publicationAdvances in Communications and Software Technologies
PublisherWorld Scientific and Engineering Academy and Society
Pages197-202
Number of pages6
ISBN (Print)9608052718
StatePublished - 2002

Keywords

  • Clustering
  • Dendrite
  • Document Categorization
  • Natural Language Processing
  • Potential

Fingerprint

Dive into the research topics of 'Selection of representative documents in a document collection'. Together they form a unique fingerprint.

Cite this