Abstract
In different situations, different documents can be selected as representative ones of document groups found in a large document set. We consider three different problems of this kind - selection of the average document, the "most typical" document and the "least typical" one, giving the corresponding algorithms. These tasks are considered in the framework of a given topic defined by a domain-oriented keyword dictionary. The procedure consists of two phases: (1) clustering documents into sub-topics and (2) definition of the representative document in each group. For the latter, the notions of potential and difference of potentials are introduced, which are applied to the dendrite constructed by the method of the nearest neighbor. Unlike the traditional clustering on dendrite, the potentials allow to take into account the structure of connections in significantly greater detail. For approach has been implemented in a new version of the system Text Classifier.
Original language | English |
---|---|
Title of host publication | Advances in Communications and Software Technologies |
Publisher | World Scientific and Engineering Academy and Society |
Pages | 197-202 |
Number of pages | 6 |
ISBN (Print) | 9608052718 |
State | Published - 2002 |
Keywords
- Clustering
- Dendrite
- Document Categorization
- Natural Language Processing
- Potential