Unsupervised WSD by finding the predominant sense using context as a dynamic thesaurus

Javier Tejada-Cárcamo; Hiram Calvo; Alexander Gelbukh; Kazuo Hara

doi:10.1007/s11390-010-9385-2

Unsupervised WSD by finding the predominant sense using context as a dynamic thesaurus

Javier Tejada-Cárcamo, Hiram Calvo, Alexander Gelbukh, Kazuo Hara

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

8 Scopus citations

Abstract

We present and analyze an unsupervised method for Word Sense Disambiguation (WSD). Our work is based on the method presented by McCarthy et al. in 2004 for finding the predominant sense of each word in the entire corpus. Their maximization algorithm allows weighted terms (similar words) from a distributional thesaurus to accumulate a score for each ambiguous word sense, i.e., the sense with the highest score is chosen based on votes from a weighted list of terms related to the ambiguous word. This list is obtained using the distributional similarity method proposed by Lin Dekang to obtain a thesaurus. In the method of McCarthy et al., every occurrence of the ambiguous word uses the same thesaurus, regardless of the context where the ambiguous word occurs. Our method accounts for the context of a word when determining the sense of an ambiguous word by building the list of distributed similar words based on the syntactic context of the ambiguous word. We obtain a top precision of 77.54% of accuracy versus 67.10% of the original method tested on SemCor. We also analyze the effect of the number of weighted terms in the tasks of finding the Most Frecuent Sense (MFS) and WSD, and experiment with several corpora for building the Word Space Model.

Original language	English
Pages (from-to)	1030-1039
Number of pages	10
Journal	Journal of Computer Science and Technology
Volume	25
Issue number	5
DOIs	https://doi.org/10.1007/s11390-010-9385-2
State	Published - Sep 2010

Keywords

Semantic similarity
Text corpus
Thesaurus
Word sense disambiguation
Word space model

Access to Document

10.1007/s11390-010-9385-2

Cite this

@article{b311d665417c4a8ca175a7a43ef4b195,

title = "Unsupervised WSD by finding the predominant sense using context as a dynamic thesaurus",

abstract = "We present and analyze an unsupervised method for Word Sense Disambiguation (WSD). Our work is based on the method presented by McCarthy et al. in 2004 for finding the predominant sense of each word in the entire corpus. Their maximization algorithm allows weighted terms (similar words) from a distributional thesaurus to accumulate a score for each ambiguous word sense, i.e., the sense with the highest score is chosen based on votes from a weighted list of terms related to the ambiguous word. This list is obtained using the distributional similarity method proposed by Lin Dekang to obtain a thesaurus. In the method of McCarthy et al., every occurrence of the ambiguous word uses the same thesaurus, regardless of the context where the ambiguous word occurs. Our method accounts for the context of a word when determining the sense of an ambiguous word by building the list of distributed similar words based on the syntactic context of the ambiguous word. We obtain a top precision of 77.54% of accuracy versus 67.10% of the original method tested on SemCor. We also analyze the effect of the number of weighted terms in the tasks of finding the Most Frecuent Sense (MFS) and WSD, and experiment with several corpora for building the Word Space Model.",

keywords = "Semantic similarity, Text corpus, Thesaurus, Word sense disambiguation, Word space model",

author = "Javier Tejada-C{\'a}rcamo and Hiram Calvo and Alexander Gelbukh and Kazuo Hara",

note = "Funding Information: SMUaw CNTS-Antwerp Sinequa-LIA - HMM MFS UNED-AW-U2 UNED-AW-U UCLA-gchao2 UCLA-gchao3 CL Research-DIMAP CL Research-DIMAP (R) UCLA-gchao Regular Paper Supported by the Mexican Government (SNI, SIP-IPN, COFAA-IPN, and PIFI-IPN), CONACYT and the Japanese Government. {\textcopyright}2010 Springer Science + Business Media, LLC & Science Press, China",

year = "2010",

month = sep,

doi = "10.1007/s11390-010-9385-2",

language = "Ingl{\'e}s",

volume = "25",

pages = "1030--1039",

journal = "Journal of Computer Science and Technology",

issn = "1000-9000",

number = "5",

}

TY - JOUR

T1 - Unsupervised WSD by finding the predominant sense using context as a dynamic thesaurus

AU - Tejada-Cárcamo, Javier

AU - Calvo, Hiram

AU - Gelbukh, Alexander

AU - Hara, Kazuo

N1 - Funding Information: SMUaw CNTS-Antwerp Sinequa-LIA - HMM MFS UNED-AW-U2 UNED-AW-U UCLA-gchao2 UCLA-gchao3 CL Research-DIMAP CL Research-DIMAP (R) UCLA-gchao Regular Paper Supported by the Mexican Government (SNI, SIP-IPN, COFAA-IPN, and PIFI-IPN), CONACYT and the Japanese Government. ©2010 Springer Science + Business Media, LLC & Science Press, China

PY - 2010/9

Y1 - 2010/9

N2 - We present and analyze an unsupervised method for Word Sense Disambiguation (WSD). Our work is based on the method presented by McCarthy et al. in 2004 for finding the predominant sense of each word in the entire corpus. Their maximization algorithm allows weighted terms (similar words) from a distributional thesaurus to accumulate a score for each ambiguous word sense, i.e., the sense with the highest score is chosen based on votes from a weighted list of terms related to the ambiguous word. This list is obtained using the distributional similarity method proposed by Lin Dekang to obtain a thesaurus. In the method of McCarthy et al., every occurrence of the ambiguous word uses the same thesaurus, regardless of the context where the ambiguous word occurs. Our method accounts for the context of a word when determining the sense of an ambiguous word by building the list of distributed similar words based on the syntactic context of the ambiguous word. We obtain a top precision of 77.54% of accuracy versus 67.10% of the original method tested on SemCor. We also analyze the effect of the number of weighted terms in the tasks of finding the Most Frecuent Sense (MFS) and WSD, and experiment with several corpora for building the Word Space Model.

AB - We present and analyze an unsupervised method for Word Sense Disambiguation (WSD). Our work is based on the method presented by McCarthy et al. in 2004 for finding the predominant sense of each word in the entire corpus. Their maximization algorithm allows weighted terms (similar words) from a distributional thesaurus to accumulate a score for each ambiguous word sense, i.e., the sense with the highest score is chosen based on votes from a weighted list of terms related to the ambiguous word. This list is obtained using the distributional similarity method proposed by Lin Dekang to obtain a thesaurus. In the method of McCarthy et al., every occurrence of the ambiguous word uses the same thesaurus, regardless of the context where the ambiguous word occurs. Our method accounts for the context of a word when determining the sense of an ambiguous word by building the list of distributed similar words based on the syntactic context of the ambiguous word. We obtain a top precision of 77.54% of accuracy versus 67.10% of the original method tested on SemCor. We also analyze the effect of the number of weighted terms in the tasks of finding the Most Frecuent Sense (MFS) and WSD, and experiment with several corpora for building the Word Space Model.

KW - Semantic similarity

KW - Text corpus

KW - Thesaurus

KW - Word sense disambiguation

KW - Word space model

UR - http://www.scopus.com/inward/record.url?scp=78650204798&partnerID=8YFLogxK

U2 - 10.1007/s11390-010-9385-2

DO - 10.1007/s11390-010-9385-2

M3 - Artículo

SN - 1000-9000

VL - 25

SP - 1030

EP - 1039

JO - Journal of Computer Science and Technology

JF - Journal of Computer Science and Technology

IS - 5

ER -

Unsupervised WSD by finding the predominant sense using context as a dynamic thesaurus

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this