Automatic Detection of Similarity of Programs in Karel Programming Language based on Natural Language Processing Techniques

Grigori Sidorov, Martin Ibarra Romero, Ilia Markov, Rafael Guzman-Cabrera, Liliana Chanona-Herńandez, Francisco Vel asquez

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

In this paper, we present a method for calculating similarity between programs (source codes). One of the applications of the task is detection of code reuse, for example, in the case of plagiarism. The Karel programming language is used for experiments. In order to determine similarity between Karel programs and/or similar software solutions, we make use of techniques from the fields of natural language processing and information retrieval. These techniques use representations of documents as vectors of features and their values. Usually, the features are n-grams of words or n-grams of characters. In addition, we consider application of the latent semantic analysis for reduction of the number of dimensions of the vector space. Finally, we use a supervised machine learning approach for classification of texts (or programs, which are texts as well) based on their similarity. For evaluation of the proposed method, two corpora were developed: The first corpus is composed of 100 different programs with a total of 9,341 source codes. The second corpus consists of 34 tasks with a total of 374 codes, which are grouped by the proposed solution. Our experiments showed that for the first corpus, the best results were obtained using trigrams of terms (words) accompanied with application of latent semantic analysis, while for the second corpus, the best representation was achieved using character trigrams.
Original languageAmerican English
Pages (from-to)279-288
Number of pages250
JournalComputacion y Sistemas
DOIs
StatePublished - 1 Jan 2016
Externally publishedYes

Fingerprint

Computer programming languages
Processing
Semantics
Vector spaces
Information retrieval
Learning systems
Experiments
experiment
software
code
automatic detection
programme
analysis
method

Cite this

Sidorov, Grigori ; Romero, Martin Ibarra ; Markov, Ilia ; Guzman-Cabrera, Rafael ; Chanona-Herńandez, Liliana ; Vel asquez, Francisco. / Automatic Detection of Similarity of Programs in Karel Programming Language based on Natural Language Processing Techniques. In: Computacion y Sistemas. 2016 ; pp. 279-288.
@article{ce81321d510c45e0a5f66a66bad9f759,
title = "Automatic Detection of Similarity of Programs in Karel Programming Language based on Natural Language Processing Techniques",
abstract = "In this paper, we present a method for calculating similarity between programs (source codes). One of the applications of the task is detection of code reuse, for example, in the case of plagiarism. The Karel programming language is used for experiments. In order to determine similarity between Karel programs and/or similar software solutions, we make use of techniques from the fields of natural language processing and information retrieval. These techniques use representations of documents as vectors of features and their values. Usually, the features are n-grams of words or n-grams of characters. In addition, we consider application of the latent semantic analysis for reduction of the number of dimensions of the vector space. Finally, we use a supervised machine learning approach for classification of texts (or programs, which are texts as well) based on their similarity. For evaluation of the proposed method, two corpora were developed: The first corpus is composed of 100 different programs with a total of 9,341 source codes. The second corpus consists of 34 tasks with a total of 374 codes, which are grouped by the proposed solution. Our experiments showed that for the first corpus, the best results were obtained using trigrams of terms (words) accompanied with application of latent semantic analysis, while for the second corpus, the best representation was achieved using character trigrams.",
author = "Grigori Sidorov and Romero, {Martin Ibarra} and Ilia Markov and Rafael Guzman-Cabrera and Liliana Chanona-Herńandez and {Vel asquez}, Francisco",
year = "2016",
month = "1",
day = "1",
doi = "10.13053/CyS-20-2-2369",
language = "American English",
pages = "279--288",
journal = "Computacion y Sistemas",
issn = "1405-5546",
publisher = "Centro de Investigacion en Computacion (CIC) del Instituto Politecnico Nacional (IPN)",

}

Automatic Detection of Similarity of Programs in Karel Programming Language based on Natural Language Processing Techniques. / Sidorov, Grigori; Romero, Martin Ibarra; Markov, Ilia; Guzman-Cabrera, Rafael; Chanona-Herńandez, Liliana; Vel asquez, Francisco.

In: Computacion y Sistemas, 01.01.2016, p. 279-288.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Automatic Detection of Similarity of Programs in Karel Programming Language based on Natural Language Processing Techniques

AU - Sidorov, Grigori

AU - Romero, Martin Ibarra

AU - Markov, Ilia

AU - Guzman-Cabrera, Rafael

AU - Chanona-Herńandez, Liliana

AU - Vel asquez, Francisco

PY - 2016/1/1

Y1 - 2016/1/1

N2 - In this paper, we present a method for calculating similarity between programs (source codes). One of the applications of the task is detection of code reuse, for example, in the case of plagiarism. The Karel programming language is used for experiments. In order to determine similarity between Karel programs and/or similar software solutions, we make use of techniques from the fields of natural language processing and information retrieval. These techniques use representations of documents as vectors of features and their values. Usually, the features are n-grams of words or n-grams of characters. In addition, we consider application of the latent semantic analysis for reduction of the number of dimensions of the vector space. Finally, we use a supervised machine learning approach for classification of texts (or programs, which are texts as well) based on their similarity. For evaluation of the proposed method, two corpora were developed: The first corpus is composed of 100 different programs with a total of 9,341 source codes. The second corpus consists of 34 tasks with a total of 374 codes, which are grouped by the proposed solution. Our experiments showed that for the first corpus, the best results were obtained using trigrams of terms (words) accompanied with application of latent semantic analysis, while for the second corpus, the best representation was achieved using character trigrams.

AB - In this paper, we present a method for calculating similarity between programs (source codes). One of the applications of the task is detection of code reuse, for example, in the case of plagiarism. The Karel programming language is used for experiments. In order to determine similarity between Karel programs and/or similar software solutions, we make use of techniques from the fields of natural language processing and information retrieval. These techniques use representations of documents as vectors of features and their values. Usually, the features are n-grams of words or n-grams of characters. In addition, we consider application of the latent semantic analysis for reduction of the number of dimensions of the vector space. Finally, we use a supervised machine learning approach for classification of texts (or programs, which are texts as well) based on their similarity. For evaluation of the proposed method, two corpora were developed: The first corpus is composed of 100 different programs with a total of 9,341 source codes. The second corpus consists of 34 tasks with a total of 374 codes, which are grouped by the proposed solution. Our experiments showed that for the first corpus, the best results were obtained using trigrams of terms (words) accompanied with application of latent semantic analysis, while for the second corpus, the best representation was achieved using character trigrams.

UR - https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84976892925&origin=inward

UR - https://www.scopus.com/inward/citedby.uri?partnerID=HzOxMe3b&scp=84976892925&origin=inward

U2 - 10.13053/CyS-20-2-2369

DO - 10.13053/CyS-20-2-2369

M3 - Article

SP - 279

EP - 288

JO - Computacion y Sistemas

JF - Computacion y Sistemas

SN - 1405-5546

ER -