Division of Spanish words into morphemes with a genetic algorithm

Alexander Gelbukh; Grigori Sidorov; Diego Lara-Reyes; Liliana Chanona-Hernandez

doi:10.1007/978-3-540-69858-6_4

Division of Spanish words into morphemes with a genetic algorithm

Alexander Gelbukh, Grigori Sidorov, Diego Lara-Reyes, Liliana Chanona-Hernandez

Centro de Investigación en Computación (CIC)

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

3 Citas (Scopus)

Resumen

We discuss an unsupervised technique for determining morpheme structure of words in an inflective language, with Spanish as a case study. For this, we use a global optimization (implemented with a genetic algorithm), while most of the previous works are based on heuristics calculated using conditional probabilities of word parts. Thus, we deal with complete space of solutions and do not reduce it with the risk to eliminate some correct solutions beforehand. Also, we are working at the derivative level as contrasted with the more traditional grammatical level interested only in flexions. The algorithm works as follows. The input data is a wordlist built on the base of a large dictionary or corpus in the given language and the output data is the same wordlist with each word divided into morphemes. First, we build a redundant list of all strings that might possibly be prefixes, suffixes, and stems of the words in the wordlist. Then, we detect possible paradigms in this set and filter out all items from the lists of possible prefixes and suffixes (though not stems) that do not participate in such paradigms. Finally, a subset of those lists of possible prefixes, stems, and suffixes is chosen using the genetic algorithm. The fitness function is based on the ideas of minimum length description, i.e. we choose the minimum number of elements that are necessary for covering all the words. The obtained subset is used for dividing the words from the wordlist. Algorithm parameters are presented. Preliminary evaluation of the experimental results for a dictionary of Spanish is given.

Idioma original	Inglés
Título de la publicación alojada	Natural Language and Information Systems - 13th International Conference on Applications of Natural Language to Information Systems, NLDB 2008, Proceedings
Páginas	19-26
Número de páginas	8
DOI	https://doi.org/10.1007/978-3-540-69858-6_4
Estado	Publicada - 2008
Evento	13th International Conference on Natural Language and Information Systems, NLDB 2008 - London, Reino Unido Duración: 24 jun. 2008 → 27 jun. 2008

Serie de la publicación

Nombre	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volumen	5039 LNCS
ISSN (versión impresa)	0302-9743
ISSN (versión digital)	1611-3349

Conferencia

Conferencia	13th International Conference on Natural Language and Information Systems, NLDB 2008
País/Territorio	Reino Unido
Ciudad	London
Período	24/06/08 → 27/06/08

Acceder al documento

10.1007/978-3-540-69858-6_4

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

Gelbukh, A., Sidorov, G., Lara-Reyes, D., & Chanona-Hernandez, L. (2008). Division of Spanish words into morphemes with a genetic algorithm. En Natural Language and Information Systems - 13th International Conference on Applications of Natural Language to Information Systems, NLDB 2008, Proceedings (pp. 19-26). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5039 LNCS). https://doi.org/10.1007/978-3-540-69858-6_4

Gelbukh, Alexander ; Sidorov, Grigori ; Lara-Reyes, Diego et al. / Division of Spanish words into morphemes with a genetic algorithm. Natural Language and Information Systems - 13th International Conference on Applications of Natural Language to Information Systems, NLDB 2008, Proceedings. 2008. pp. 19-26 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{169896ad83a04cc484f695b28087c75c,

title = "Division of Spanish words into morphemes with a genetic algorithm",

abstract = "We discuss an unsupervised technique for determining morpheme structure of words in an inflective language, with Spanish as a case study. For this, we use a global optimization (implemented with a genetic algorithm), while most of the previous works are based on heuristics calculated using conditional probabilities of word parts. Thus, we deal with complete space of solutions and do not reduce it with the risk to eliminate some correct solutions beforehand. Also, we are working at the derivative level as contrasted with the more traditional grammatical level interested only in flexions. The algorithm works as follows. The input data is a wordlist built on the base of a large dictionary or corpus in the given language and the output data is the same wordlist with each word divided into morphemes. First, we build a redundant list of all strings that might possibly be prefixes, suffixes, and stems of the words in the wordlist. Then, we detect possible paradigms in this set and filter out all items from the lists of possible prefixes and suffixes (though not stems) that do not participate in such paradigms. Finally, a subset of those lists of possible prefixes, stems, and suffixes is chosen using the genetic algorithm. The fitness function is based on the ideas of minimum length description, i.e. we choose the minimum number of elements that are necessary for covering all the words. The obtained subset is used for dividing the words from the wordlist. Algorithm parameters are presented. Preliminary evaluation of the experimental results for a dictionary of Spanish is given.",

author = "Alexander Gelbukh and Grigori Sidorov and Diego Lara-Reyes and Liliana Chanona-Hernandez",

note = "Funding Information: Work done under partial support of Mexican Government (CONACYT, SNI) and National Polytechnic Institute, Mexico (SIP, COFAA, PIFI).; 13th International Conference on Natural Language and Information Systems, NLDB 2008 ; Conference date: 24-06-2008 Through 27-06-2008",

year = "2008",

doi = "10.1007/978-3-540-69858-6_4",

language = "Ingl{\'e}s",

isbn = "3540698574",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

pages = "19--26",

booktitle = "Natural Language and Information Systems - 13th International Conference on Applications of Natural Language to Information Systems, NLDB 2008, Proceedings",

}

Gelbukh, A , Sidorov, G, Lara-Reyes, D & Chanona-Hernandez, L 2008, Division of Spanish words into morphemes with a genetic algorithm. En Natural Language and Information Systems - 13th International Conference on Applications of Natural Language to Information Systems, NLDB 2008, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5039 LNCS, pp. 19-26, 13th International Conference on Natural Language and Information Systems, NLDB 2008, London, Reino Unido, 24/06/08. https://doi.org/10.1007/978-3-540-69858-6_4

Division of Spanish words into morphemes with a genetic algorithm. / Gelbukh, Alexander ; Sidorov, Grigori; Lara-Reyes, Diego et al.
Natural Language and Information Systems - 13th International Conference on Applications of Natural Language to Information Systems, NLDB 2008, Proceedings. 2008. p. 19-26 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5039 LNCS).

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

TY - GEN

T1 - Division of Spanish words into morphemes with a genetic algorithm

AU - Gelbukh, Alexander

AU - Sidorov, Grigori

AU - Lara-Reyes, Diego

AU - Chanona-Hernandez, Liliana

N1 - Funding Information: Work done under partial support of Mexican Government (CONACYT, SNI) and National Polytechnic Institute, Mexico (SIP, COFAA, PIFI).

PY - 2008

Y1 - 2008

N2 - We discuss an unsupervised technique for determining morpheme structure of words in an inflective language, with Spanish as a case study. For this, we use a global optimization (implemented with a genetic algorithm), while most of the previous works are based on heuristics calculated using conditional probabilities of word parts. Thus, we deal with complete space of solutions and do not reduce it with the risk to eliminate some correct solutions beforehand. Also, we are working at the derivative level as contrasted with the more traditional grammatical level interested only in flexions. The algorithm works as follows. The input data is a wordlist built on the base of a large dictionary or corpus in the given language and the output data is the same wordlist with each word divided into morphemes. First, we build a redundant list of all strings that might possibly be prefixes, suffixes, and stems of the words in the wordlist. Then, we detect possible paradigms in this set and filter out all items from the lists of possible prefixes and suffixes (though not stems) that do not participate in such paradigms. Finally, a subset of those lists of possible prefixes, stems, and suffixes is chosen using the genetic algorithm. The fitness function is based on the ideas of minimum length description, i.e. we choose the minimum number of elements that are necessary for covering all the words. The obtained subset is used for dividing the words from the wordlist. Algorithm parameters are presented. Preliminary evaluation of the experimental results for a dictionary of Spanish is given.

AB - We discuss an unsupervised technique for determining morpheme structure of words in an inflective language, with Spanish as a case study. For this, we use a global optimization (implemented with a genetic algorithm), while most of the previous works are based on heuristics calculated using conditional probabilities of word parts. Thus, we deal with complete space of solutions and do not reduce it with the risk to eliminate some correct solutions beforehand. Also, we are working at the derivative level as contrasted with the more traditional grammatical level interested only in flexions. The algorithm works as follows. The input data is a wordlist built on the base of a large dictionary or corpus in the given language and the output data is the same wordlist with each word divided into morphemes. First, we build a redundant list of all strings that might possibly be prefixes, suffixes, and stems of the words in the wordlist. Then, we detect possible paradigms in this set and filter out all items from the lists of possible prefixes and suffixes (though not stems) that do not participate in such paradigms. Finally, a subset of those lists of possible prefixes, stems, and suffixes is chosen using the genetic algorithm. The fitness function is based on the ideas of minimum length description, i.e. we choose the minimum number of elements that are necessary for covering all the words. The obtained subset is used for dividing the words from the wordlist. Algorithm parameters are presented. Preliminary evaluation of the experimental results for a dictionary of Spanish is given.

UR - http://www.scopus.com/inward/record.url?scp=47749154862&partnerID=8YFLogxK

U2 - 10.1007/978-3-540-69858-6_4

DO - 10.1007/978-3-540-69858-6_4

M3 - Contribución a la conferencia

SN - 3540698574

SN - 9783540698579

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 19

EP - 26

BT - Natural Language and Information Systems - 13th International Conference on Applications of Natural Language to Information Systems, NLDB 2008, Proceedings

T2 - 13th International Conference on Natural Language and Information Systems, NLDB 2008

Y2 - 24 June 2008 through 27 June 2008

ER -

Gelbukh A , Sidorov G, Lara-Reyes D, Chanona-Hernandez L. Division of Spanish words into morphemes with a genetic algorithm. En Natural Language and Information Systems - 13th International Conference on Applications of Natural Language to Information Systems, NLDB 2008, Proceedings. 2008. p. 19-26. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-540-69858-6_4

Division of Spanish words into morphemes with a genetic algorithm

Resumen

Serie de la publicación

Conferencia

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto