TY - GEN
T1 - Various criteria of collocation cohesion in internet
T2 - 9th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2008
AU - Bolshakov, Igor A.
AU - Bolshakova, Elena I.
AU - Kotlyarov, Alexey P.
AU - Gelbukh, Alexander
N1 - Funding Information:
Work done under partial support of Mexican Government (CONACyT, SNI, CGEPI-IPN) and Russian Foundation of Fundamental Research (grant 06-01-00571).
PY - 2008
Y1 - 2008
N2 - For extracting collocations from the Internet, it is necessary to numerically estimate the cohesion between potential collocates. Mutual Information cohesion measure (MI) based on numbers of collocate occurring closely together (N 12) and apart (N 1, N 2) is well known, but the Web page statistics deprives MI of its statistical validity. We propose a family of different measures that depend on N 1, N 2 and N 12 in a similar monotonic way and possess the scalability feature of MI. We apply the new criteria for a collection of N 1, N 2, and N 12 obtained from AltaVista for links between a few tens of English nouns and several hundreds of their modifiers taken from Oxford Collocations Dictionary. The 'noun-its own adjective' pairs are true collocations and their measure values form one distribution. The 'noun-alien adjective' pairs are false collocations and their measure values form another distribution. The discriminating threshold is searched for to minimize the sum of probabilities for errors of two possible types. The resolving power of a criterion is equal to the minimum of the sum. The best criterion delivering minimum minimorum is found.
AB - For extracting collocations from the Internet, it is necessary to numerically estimate the cohesion between potential collocates. Mutual Information cohesion measure (MI) based on numbers of collocate occurring closely together (N 12) and apart (N 1, N 2) is well known, but the Web page statistics deprives MI of its statistical validity. We propose a family of different measures that depend on N 1, N 2 and N 12 in a similar monotonic way and possess the scalability feature of MI. We apply the new criteria for a collection of N 1, N 2, and N 12 obtained from AltaVista for links between a few tens of English nouns and several hundreds of their modifiers taken from Oxford Collocations Dictionary. The 'noun-its own adjective' pairs are true collocations and their measure values form one distribution. The 'noun-alien adjective' pairs are false collocations and their measure values form another distribution. The discriminating threshold is searched for to minimize the sum of probabilities for errors of two possible types. The resolving power of a criterion is equal to the minimum of the sum. The best criterion delivering minimum minimorum is found.
UR - http://www.scopus.com/inward/record.url?scp=49949100097&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-78135-6_6
DO - 10.1007/978-3-540-78135-6_6
M3 - Contribución a la conferencia
SN - 354078134X
SN - 9783540781349
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 64
EP - 72
BT - Computational Linguistics and Intelligent Text Processing - 9th International Conference, CICLing 2008, Proceedings
Y2 - 17 February 2008 through 23 February 2008
ER -