TY - GEN
T1 - Compilation of a Spanish representative corpus
AU - Gelbukh, Alexander
AU - Sidorov, Grigori
AU - Chanona-Hernández, Liliana
N1 - Publisher Copyright:
© Springer-Verlag Berlin Heidelberg 2002.
PY - 2002
Y1 - 2002
N2 - Due to the Zipf law, even a very large corpus contains very few occurrences (tokens) for the majority of its different words (types). Only a corpus containing enough occurrences of even rare words can provide necessary statistical information for the study of contextual usage of words. We call such corpus representative and suggest to use Internet for its compilation. The corresponding algorithm and its application to Spanish are described. Different concepts of a representative corpus are discussed.
AB - Due to the Zipf law, even a very large corpus contains very few occurrences (tokens) for the majority of its different words (types). Only a corpus containing enough occurrences of even rare words can provide necessary statistical information for the study of contextual usage of words. We call such corpus representative and suggest to use Internet for its compilation. The corresponding algorithm and its application to Spanish are described. Different concepts of a representative corpus are discussed.
UR - http://www.scopus.com/inward/record.url?scp=84957710250&partnerID=8YFLogxK
U2 - 10.1007/3-540-45715-1_27
DO - 10.1007/3-540-45715-1_27
M3 - Contribución a la conferencia
SN - 3540432191
SN - 9783540457152
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 285
EP - 288
BT - Computational Linguistics and Intelligent Text Processing - 3rd International Conference, CICLing 2002, Proceedings
A2 - Gelbukh, Alexander
PB - Springer Verlag
T2 - 3rd Annual Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2002
Y2 - 17 February 2002 through 23 February 2002
ER -