An approach to clustering abstracts

Mikhail Alexandrov; Alexander Gelbukh; Paolo Rosso

doi:10.1007/11428817_25

An approach to clustering abstracts

Mikhail Alexandrov, Alexander Gelbukh, Paolo Rosso

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Conference article › peer-review

31 Scopus citations

Abstract

Free access to full-text scientific papers in major digital libraries and other web repositories is limited to only their abstracts consisting of no more than several dozens of words. Current keyword-based techniques allow for clustering such type of short texts only when the data set is multi-category, e.g., some documents are devoted to sport, others to medicine, others to politics, etc. However, they fail on narrow domain-oriented libraries, e.g., those containing all documents only on physics, or all on geology, or all on computational linguistics, etc. Nevertheless, just such data sets are the most frequent and most interesting ones. We propose simple procedure to cluster abstracts, which consists in grouping keywords and using more adequate document similarity measure. We use Stein's MajorClust method for clustering both keywords and documents. We illustrate our approach on the texts from the Proceedings of a narrow-topic conference. Limitations of our approach are also discussed. Our preliminary experiments show that abstracts cannot be clustered with the same quality as full texts, though the achieved quality is adequate for many applications; accordingly, we suggest Makagonov's proposal that digital libraries should provide document images of full texts of the papers (and not only abstracts) for open access via Internet, in order to help in search, classification, clustering, selection, and proper referencing of the papers.

Original language	English
Pages (from-to)	275-285
Number of pages	11
Journal	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	3513
DOIs	https://doi.org/10.1007/11428817_25
State	Published - 2005
Event	10th International Conference on Applications of Natural Language to Information Systems, NLDB 2005: Natural Language Processing and Information Systems - Alicante, Spain Duration: 15 Jun 2005 → 17 Jun 2005

Access to Document

10.1007/11428817_25

Cite this

@article{fe4fbd124b094cb78b2945e0fced56a8,

title = "An approach to clustering abstracts",

abstract = "Free access to full-text scientific papers in major digital libraries and other web repositories is limited to only their abstracts consisting of no more than several dozens of words. Current keyword-based techniques allow for clustering such type of short texts only when the data set is multi-category, e.g., some documents are devoted to sport, others to medicine, others to politics, etc. However, they fail on narrow domain-oriented libraries, e.g., those containing all documents only on physics, or all on geology, or all on computational linguistics, etc. Nevertheless, just such data sets are the most frequent and most interesting ones. We propose simple procedure to cluster abstracts, which consists in grouping keywords and using more adequate document similarity measure. We use Stein's MajorClust method for clustering both keywords and documents. We illustrate our approach on the texts from the Proceedings of a narrow-topic conference. Limitations of our approach are also discussed. Our preliminary experiments show that abstracts cannot be clustered with the same quality as full texts, though the achieved quality is adequate for many applications; accordingly, we suggest Makagonov's proposal that digital libraries should provide document images of full texts of the papers (and not only abstracts) for open access via Internet, in order to help in search, classification, clustering, selection, and proper referencing of the papers.",

author = "Mikhail Alexandrov and Alexander Gelbukh and Paolo Rosso",

year = "2005",

doi = "10.1007/11428817_25",

language = "Ingl{\'e}s",

volume = "3513",

pages = "275--285",

journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

issn = "0302-9743",

publisher = "Springer Verlag",

note = "10th International Conference on Applications of Natural Language to Information Systems, NLDB 2005: Natural Language Processing and Information Systems ; Conference date: 15-06-2005 Through 17-06-2005",

}

TY - JOUR

T1 - An approach to clustering abstracts

AU - Alexandrov, Mikhail

AU - Gelbukh, Alexander

AU - Rosso, Paolo

PY - 2005

Y1 - 2005

N2 - Free access to full-text scientific papers in major digital libraries and other web repositories is limited to only their abstracts consisting of no more than several dozens of words. Current keyword-based techniques allow for clustering such type of short texts only when the data set is multi-category, e.g., some documents are devoted to sport, others to medicine, others to politics, etc. However, they fail on narrow domain-oriented libraries, e.g., those containing all documents only on physics, or all on geology, or all on computational linguistics, etc. Nevertheless, just such data sets are the most frequent and most interesting ones. We propose simple procedure to cluster abstracts, which consists in grouping keywords and using more adequate document similarity measure. We use Stein's MajorClust method for clustering both keywords and documents. We illustrate our approach on the texts from the Proceedings of a narrow-topic conference. Limitations of our approach are also discussed. Our preliminary experiments show that abstracts cannot be clustered with the same quality as full texts, though the achieved quality is adequate for many applications; accordingly, we suggest Makagonov's proposal that digital libraries should provide document images of full texts of the papers (and not only abstracts) for open access via Internet, in order to help in search, classification, clustering, selection, and proper referencing of the papers.

AB - Free access to full-text scientific papers in major digital libraries and other web repositories is limited to only their abstracts consisting of no more than several dozens of words. Current keyword-based techniques allow for clustering such type of short texts only when the data set is multi-category, e.g., some documents are devoted to sport, others to medicine, others to politics, etc. However, they fail on narrow domain-oriented libraries, e.g., those containing all documents only on physics, or all on geology, or all on computational linguistics, etc. Nevertheless, just such data sets are the most frequent and most interesting ones. We propose simple procedure to cluster abstracts, which consists in grouping keywords and using more adequate document similarity measure. We use Stein's MajorClust method for clustering both keywords and documents. We illustrate our approach on the texts from the Proceedings of a narrow-topic conference. Limitations of our approach are also discussed. Our preliminary experiments show that abstracts cannot be clustered with the same quality as full texts, though the achieved quality is adequate for many applications; accordingly, we suggest Makagonov's proposal that digital libraries should provide document images of full texts of the papers (and not only abstracts) for open access via Internet, in order to help in search, classification, clustering, selection, and proper referencing of the papers.

UR - http://www.scopus.com/inward/record.url?scp=25144495542&partnerID=8YFLogxK

U2 - 10.1007/11428817_25

DO - 10.1007/11428817_25

M3 - Artículo de la conferencia

AN - SCOPUS:25144495542

SN - 0302-9743

VL - 3513

SP - 275

EP - 285

JO - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

JF - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

T2 - 10th International Conference on Applications of Natural Language to Information Systems, NLDB 2005: Natural Language Processing and Information Systems

Y2 - 15 June 2005 through 17 June 2005

ER -

An approach to clustering abstracts

Abstract

Access to Document

Other files and links

Fingerprint

Cite this