Compilation of a Spanish representative corpus

Alexander Gelbukh; Grigori Sidorov; Liliana Chanona-Hernández

doi:10.1007/3-540-45715-1_27

Compilation of a Spanish representative corpus

Alexander Gelbukh, Grigori Sidorov, Liliana Chanona-Hernández

Centro de Investigación en Computación (CIC)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

10 Scopus citations

Abstract

Due to the Zipf law, even a very large corpus contains very few occurrences (tokens) for the majority of its different words (types). Only a corpus containing enough occurrences of even rare words can provide necessary statistical information for the study of contextual usage of words. We call such corpus representative and suggest to use Internet for its compilation. The corresponding algorithm and its application to Spanish are described. Different concepts of a representative corpus are discussed.

Original language	English
Title of host publication	Computational Linguistics and Intelligent Text Processing - 3rd International Conference, CICLing 2002, Proceedings
Editors	Alexander Gelbukh
Publisher	Springer Verlag
Pages	285-288
Number of pages	4
ISBN (Print)	3540432191, 9783540457152
DOIs	https://doi.org/10.1007/3-540-45715-1_27
State	Published - 2002
Event	3rd Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2002 - Mexico City, Mexico Duration: 17 Feb 2002 → 23 Feb 2002

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	2276
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	3rd Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2002
Country/Territory	Mexico
City	Mexico City
Period	17/02/02 → 23/02/02

Access to Document

10.1007/3-540-45715-1_27

Cite this

Gelbukh, A., Sidorov, G., & Chanona-Hernández, L. (2002). Compilation of a Spanish representative corpus. In A. Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing - 3rd International Conference, CICLing 2002, Proceedings (pp. 285-288). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2276). Springer Verlag. https://doi.org/10.1007/3-540-45715-1_27

Gelbukh, Alexander ; Sidorov, Grigori ; Chanona-Hernández, Liliana. / Compilation of a Spanish representative corpus. Computational Linguistics and Intelligent Text Processing - 3rd International Conference, CICLing 2002, Proceedings. editor / Alexander Gelbukh. Springer Verlag, 2002. pp. 285-288 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{0c5e82e4432043ccb02abe7eb16aae9e,

title = "Compilation of a Spanish representative corpus",

abstract = "Due to the Zipf law, even a very large corpus contains very few occurrences (tokens) for the majority of its different words (types). Only a corpus containing enough occurrences of even rare words can provide necessary statistical information for the study of contextual usage of words. We call such corpus representative and suggest to use Internet for its compilation. The corresponding algorithm and its application to Spanish are described. Different concepts of a representative corpus are discussed.",

author = "Alexander Gelbukh and Grigori Sidorov and Liliana Chanona-Hern{\'a}ndez",

note = "Publisher Copyright: {\textcopyright} Springer-Verlag Berlin Heidelberg 2002.; 3rd Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2002 ; Conference date: 17-02-2002 Through 23-02-2002",

year = "2002",

doi = "10.1007/3-540-45715-1_27",

language = "Ingl{\'e}s",

isbn = "3540432191",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "285--288",

editor = "Alexander Gelbukh",

booktitle = "Computational Linguistics and Intelligent Text Processing - 3rd International Conference, CICLing 2002, Proceedings",

address = "Alemania",

}

Gelbukh, A , Sidorov, G & Chanona-Hernández, L 2002, Compilation of a Spanish representative corpus. in A Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing - 3rd International Conference, CICLing 2002, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2276, Springer Verlag, pp. 285-288, 3rd Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2002, Mexico City, Mexico, 17/02/02. https://doi.org/10.1007/3-540-45715-1_27

Compilation of a Spanish representative corpus. / Gelbukh, Alexander ; Sidorov, Grigori; Chanona-Hernández, Liliana.
Computational Linguistics and Intelligent Text Processing - 3rd International Conference, CICLing 2002, Proceedings. ed. / Alexander Gelbukh. Springer Verlag, 2002. p. 285-288 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2276).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Compilation of a Spanish representative corpus

AU - Gelbukh, Alexander

AU - Sidorov, Grigori

AU - Chanona-Hernández, Liliana

PY - 2002

Y1 - 2002

N2 - Due to the Zipf law, even a very large corpus contains very few occurrences (tokens) for the majority of its different words (types). Only a corpus containing enough occurrences of even rare words can provide necessary statistical information for the study of contextual usage of words. We call such corpus representative and suggest to use Internet for its compilation. The corresponding algorithm and its application to Spanish are described. Different concepts of a representative corpus are discussed.

AB - Due to the Zipf law, even a very large corpus contains very few occurrences (tokens) for the majority of its different words (types). Only a corpus containing enough occurrences of even rare words can provide necessary statistical information for the study of contextual usage of words. We call such corpus representative and suggest to use Internet for its compilation. The corresponding algorithm and its application to Spanish are described. Different concepts of a representative corpus are discussed.

UR - http://www.scopus.com/inward/record.url?scp=84957710250&partnerID=8YFLogxK

U2 - 10.1007/3-540-45715-1_27

DO - 10.1007/3-540-45715-1_27

M3 - Contribución a la conferencia

SN - 3540432191

SN - 9783540457152

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 285

EP - 288

BT - Computational Linguistics and Intelligent Text Processing - 3rd International Conference, CICLing 2002, Proceedings

A2 - Gelbukh, Alexander

PB - Springer Verlag

T2 - 3rd Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2002

Y2 - 17 February 2002 through 23 February 2002

ER -

Gelbukh A , Sidorov G, Chanona-Hernández L. Compilation of a Spanish representative corpus. In Gelbukh A, editor, Computational Linguistics and Intelligent Text Processing - 3rd International Conference, CICLing 2002, Proceedings. Springer Verlag. 2002. p. 285-288. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/3-540-45715-1_27

Compilation of a Spanish representative corpus

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this