Measuring similarity between Karel programs using character and word n-grams

G. Sidorov; M. Ibarra Romero; I. Markov; R. Guzman-Cabrera; L. Chanona-Hernández; F. Velásquez

doi:10.1134/S0361768817010066

Measuring similarity between Karel programs using character and word n-grams

G. Sidorov, M. Ibarra Romero, I. Markov, R. Guzman-Cabrera, L. Chanona-Hernández, F. Velásquez

Centro de Investigación en Computación (CIC)

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

8 Citas (Scopus)

Resumen

We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.

Idioma original	Inglés
Páginas (desde-hasta)	47-50
Número de páginas	4
Publicación	Programming and Computer Software
Volumen	43
N.º	1
DOI	https://doi.org/10.1134/S0361768817010066
Estado	Publicada - 1 ene. 2017

Acceder al documento

10.1134/S0361768817010066

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{7410004aca8242f3b2387b3dd9152189,

title = "Measuring similarity between Karel programs using character and word n-grams",

abstract = "We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.",

keywords = "Karel programming language, LSA, SVM, character n-grams, machine learning, similarity, word n-grams",

author = "G. Sidorov and {Ibarra Romero}, M. and I. Markov and R. Guzman-Cabrera and L. Chanona-Hern{\'a}ndez and F. Vel{\'a}squez",

note = "Publisher Copyright: {\textcopyright} 2017, Pleiades Publishing, Ltd.",

year = "2017",

month = jan,

day = "1",

doi = "10.1134/S0361768817010066",

language = "Ingl{\'e}s",

volume = "43",

pages = "47--50",

journal = "Programming and Computer Software",

issn = "0361-7688",

number = "1",

}

TY - JOUR

T1 - Measuring similarity between Karel programs using character and word n-grams

AU - Sidorov, G.

AU - Ibarra Romero, M.

AU - Markov, I.

AU - Guzman-Cabrera, R.

AU - Chanona-Hernández, L.

AU - Velásquez, F.

PY - 2017/1/1

Y1 - 2017/1/1

N2 - We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.

AB - We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.

KW - Karel programming language

KW - LSA

KW - SVM

KW - character n-grams

KW - machine learning

KW - similarity

KW - word n-grams

UR - http://www.scopus.com/inward/record.url?scp=85013482408&partnerID=8YFLogxK

U2 - 10.1134/S0361768817010066

DO - 10.1134/S0361768817010066

M3 - Artículo

AN - SCOPUS:85013482408

SN - 0361-7688

VL - 43

SP - 47

EP - 50

JO - Programming and Computer Software

JF - Programming and Computer Software

IS - 1

ER -

Measuring similarity between Karel programs using character and word n-grams

Resumen

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto