Compilation of a Spanish representative corpus

Alexander Gelbukh, Grigori Sidorov, Liliana Chanona-Hernández

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Scopus citations

Abstract

Due to the Zipf law, even a very large corpus contains very few occurrences (tokens) for the majority of its different words (types). Only a corpus containing enough occurrences of even rare words can provide necessary statistical information for the study of contextual usage of words. We call such corpus representative and suggest to use Internet for its compilation. The corresponding algorithm and its application to Spanish are described. Different concepts of a representative corpus are discussed.

Original languageEnglish
Title of host publicationComputational Linguistics and Intelligent Text Processing - 3rd International Conference, CICLing 2002, Proceedings
EditorsAlexander Gelbukh
PublisherSpringer Verlag
Pages285-288
Number of pages4
ISBN (Print)3540432191, 9783540457152
DOIs
StatePublished - 2002
Event3rd Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2002 - Mexico City, Mexico
Duration: 17 Feb 200223 Feb 2002

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2276
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference3rd Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2002
Country/TerritoryMexico
CityMexico City
Period17/02/0223/02/02

Fingerprint

Dive into the research topics of 'Compilation of a Spanish representative corpus'. Together they form a unique fingerprint.

Cite this