TY - JOUR
T1 - Detecci on autoḿatica de similitud entre programas del lenguaje de programaci on Karel basada en t ecnicas de procesamiento de lenguaje natural
AU - Sidorov, Grigori
AU - Romero, Martin Ibarra
AU - Markov, Ilia
AU - Guzman-Cabrera, Rafael
AU - Chanona-Herńandez, Liliana
AU - Vel asquez, Francisco
PY - 2016
Y1 - 2016
N2 - In this paper, we present a method for calculating similarity between programs (source codes). One of the applications of the task is detection of code reuse, for example, in the case of plagiarism. The Karel programming language is used for experiments. In order to determine similarity between Karel programs and/or similar software solutions, we make use of techniques from the fields of natural language processing and information retrieval. These techniques use representations of documents as vectors of features and their values. Usually, the features are n-grams of words or n-grams of characters. In addition, we consider application of the latent semantic analysis for reduction of the number of dimensions of the vector space. Finally, we use a supervised machine learning approach for classification of texts (or programs, which are texts as well) based on their similarity. For evaluation of the proposed method, two corpora were developed: The first corpus is composed of 100 different programs with a total of 9,341 source codes. The second corpus consists of 34 tasks with a total of 374 codes, which are grouped by the proposed solution. Our experiments showed that for the first corpus, the best results were obtained using trigrams of terms (words) accompanied with application of latent semantic analysis, while for the second corpus, the best representation was achieved using character trigrams.
AB - In this paper, we present a method for calculating similarity between programs (source codes). One of the applications of the task is detection of code reuse, for example, in the case of plagiarism. The Karel programming language is used for experiments. In order to determine similarity between Karel programs and/or similar software solutions, we make use of techniques from the fields of natural language processing and information retrieval. These techniques use representations of documents as vectors of features and their values. Usually, the features are n-grams of words or n-grams of characters. In addition, we consider application of the latent semantic analysis for reduction of the number of dimensions of the vector space. Finally, we use a supervised machine learning approach for classification of texts (or programs, which are texts as well) based on their similarity. For evaluation of the proposed method, two corpora were developed: The first corpus is composed of 100 different programs with a total of 9,341 source codes. The second corpus consists of 34 tasks with a total of 374 codes, which are grouped by the proposed solution. Our experiments showed that for the first corpus, the best results were obtained using trigrams of terms (words) accompanied with application of latent semantic analysis, while for the second corpus, the best representation was achieved using character trigrams.
KW - Information retrieval
KW - Latent semantic analysis
KW - N-grams
KW - Natural language processing
KW - Program
KW - Similarity
KW - Source code
UR - http://www.scopus.com/inward/record.url?scp=84976892925&partnerID=8YFLogxK
U2 - 10.13053/CyS-20-2-2369
DO - 10.13053/CyS-20-2-2369
M3 - Artículo
AN - SCOPUS:84976892925
SN - 1405-5546
VL - 20
SP - 279
EP - 288
JO - Computacion y Sistemas
JF - Computacion y Sistemas
IS - 2
ER -