Assigning Library of Congress Classification codes to books based only on their titles

Ricardo Ávila-Argüelles; Hiram Calvo; Alexander Gelbukh; Salvador Godoy-Calderön

Assigning Library of Congress Classification codes to books based only on their titles

Ricardo Ávila-Argüelles, Hiram Calvo, Alexander Gelbukh, Salvador Godoy-Calderön

Centro de Investigación en Computación (CIC)

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

3 Citas (Scopus)

Resumen

Many publishers follow the Library of Congress Classification (LCC) scheme to indicate a classification code on the first pages of their books. This is useful for many libraries worldwide because it makes possible to search and retrieve books by content type, and this scheme has become a de facto standard. However, not every book has been pre-classified by the publisher; in particular, in many universities, new dissertations have to be classified manually. Although there are many systems available for automatic text classification, all of them use extensive information which is not always available, such as the index, abstract, or even the whole content of the work. In this work, we present our experiments on supervised classification ofbooks by using only their title, which would allow massive automatic indexing. We propose a new text comparison measure, which mixes two well-known text classification techniques: the Lesk voting scheme and the Term Frequency (TF). In addition, we experiment with different weighing as well as logical-combinatorial methods such as ALVOT in order to determine the contribution of the title in the correct classification. We found this contribution to be approximately one third, as we correctly classified 36% (on average by each branch) of 122, 431 previously unseen titles (in total) upon training with 489,726 samples (in total) of one major branch (Q) of the LCC catalogue.

Idioma original	Inglés
Páginas (desde-hasta)	77-84
Número de páginas	8
Publicación	Informatica (Ljubljana)
Volumen	34
N.º	1
Estado	Publicada - 2010

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{f4386d550cd14f819d4cf27aeb7df7a9,

title = "Assigning Library of Congress Classification codes to books based only on their titles",

abstract = "Many publishers follow the Library of Congress Classification (LCC) scheme to indicate a classification code on the first pages of their books. This is useful for many libraries worldwide because it makes possible to search and retrieve books by content type, and this scheme has become a de facto standard. However, not every book has been pre-classified by the publisher; in particular, in many universities, new dissertations have to be classified manually. Although there are many systems available for automatic text classification, all of them use extensive information which is not always available, such as the index, abstract, or even the whole content of the work. In this work, we present our experiments on supervised classification ofbooks by using only their title, which would allow massive automatic indexing. We propose a new text comparison measure, which mixes two well-known text classification techniques: the Lesk voting scheme and the Term Frequency (TF). In addition, we experiment with different weighing as well as logical-combinatorial methods such as ALVOT in order to determine the contribution of the title in the correct classification. We found this contribution to be approximately one third, as we correctly classified 36% (on average by each branch) of 122, 431 previously unseen titles (in total) upon training with 489,726 samples (in total) of one major branch (Q) of the LCC catalogue.",

keywords = "LCC, Library classification, Logical-combinatorial methods, Scarce information classification",

author = "Ricardo {\'A}vila-Arg{\"u}elles and Hiram Calvo and Alexander Gelbukh and Salvador Godoy-Calder{\"o}n",

year = "2010",

language = "Ingl{\'e}s",

volume = "34",

pages = "77--84",

journal = "Informatica (Ljubljana)",

issn = "0350-5596",

number = "1",

}

TY - JOUR

T1 - Assigning Library of Congress Classification codes to books based only on their titles

AU - Ávila-Argüelles, Ricardo

AU - Calvo, Hiram

AU - Gelbukh, Alexander

AU - Godoy-Calderön, Salvador

PY - 2010

Y1 - 2010

N2 - Many publishers follow the Library of Congress Classification (LCC) scheme to indicate a classification code on the first pages of their books. This is useful for many libraries worldwide because it makes possible to search and retrieve books by content type, and this scheme has become a de facto standard. However, not every book has been pre-classified by the publisher; in particular, in many universities, new dissertations have to be classified manually. Although there are many systems available for automatic text classification, all of them use extensive information which is not always available, such as the index, abstract, or even the whole content of the work. In this work, we present our experiments on supervised classification ofbooks by using only their title, which would allow massive automatic indexing. We propose a new text comparison measure, which mixes two well-known text classification techniques: the Lesk voting scheme and the Term Frequency (TF). In addition, we experiment with different weighing as well as logical-combinatorial methods such as ALVOT in order to determine the contribution of the title in the correct classification. We found this contribution to be approximately one third, as we correctly classified 36% (on average by each branch) of 122, 431 previously unseen titles (in total) upon training with 489,726 samples (in total) of one major branch (Q) of the LCC catalogue.

AB - Many publishers follow the Library of Congress Classification (LCC) scheme to indicate a classification code on the first pages of their books. This is useful for many libraries worldwide because it makes possible to search and retrieve books by content type, and this scheme has become a de facto standard. However, not every book has been pre-classified by the publisher; in particular, in many universities, new dissertations have to be classified manually. Although there are many systems available for automatic text classification, all of them use extensive information which is not always available, such as the index, abstract, or even the whole content of the work. In this work, we present our experiments on supervised classification ofbooks by using only their title, which would allow massive automatic indexing. We propose a new text comparison measure, which mixes two well-known text classification techniques: the Lesk voting scheme and the Term Frequency (TF). In addition, we experiment with different weighing as well as logical-combinatorial methods such as ALVOT in order to determine the contribution of the title in the correct classification. We found this contribution to be approximately one third, as we correctly classified 36% (on average by each branch) of 122, 431 previously unseen titles (in total) upon training with 489,726 samples (in total) of one major branch (Q) of the LCC catalogue.

KW - LCC

KW - Library classification

KW - Logical-combinatorial methods

KW - Scarce information classification

UR - http://www.scopus.com/inward/record.url?scp=77951963503&partnerID=8YFLogxK

M3 - Artículo

SN - 0350-5596

VL - 34

SP - 77

EP - 84

JO - Informatica (Ljubljana)

JF - Informatica (Ljubljana)

IS - 1

ER -

Assigning Library of Congress Classification codes to books based only on their titles

Resumen

Otros archivos y enlaces

Huella

Citar esto