Detecci on autoḿatica de similitud entre programas del lenguaje de programaci on Karel basada en t ecnicas de procesamiento de lenguaje natural

Grigori Sidorov; Martin Ibarra Romero; Ilia Markov; Rafael Guzman-Cabrera; Liliana Chanona-Herńandez; Francisco Vel asquez

doi:10.13053/CyS-20-2-2369

Detecci on autoḿatica de similitud entre programas del lenguaje de programaci on Karel basada en t ecnicas de procesamiento de lenguaje natural

Translated title of the contribution: Automatic Detection of Similarity of Programs in Karel Programming Language based on Natural Language Processing Techniques

Grigori Sidorov, Martin Ibarra Romero, Ilia Markov, Rafael Guzman-Cabrera, Liliana Chanona-Herńandez, Francisco Vel asquez

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Article › peer-review

10 Scopus citations

Abstract

In this paper, we present a method for calculating similarity between programs (source codes). One of the applications of the task is detection of code reuse, for example, in the case of plagiarism. The Karel programming language is used for experiments. In order to determine similarity between Karel programs and/or similar software solutions, we make use of techniques from the fields of natural language processing and information retrieval. These techniques use representations of documents as vectors of features and their values. Usually, the features are n-grams of words or n-grams of characters. In addition, we consider application of the latent semantic analysis for reduction of the number of dimensions of the vector space. Finally, we use a supervised machine learning approach for classification of texts (or programs, which are texts as well) based on their similarity. For evaluation of the proposed method, two corpora were developed: The first corpus is composed of 100 different programs with a total of 9,341 source codes. The second corpus consists of 34 tasks with a total of 374 codes, which are grouped by the proposed solution. Our experiments showed that for the first corpus, the best results were obtained using trigrams of terms (words) accompanied with application of latent semantic analysis, while for the second corpus, the best representation was achieved using character trigrams.

Translated title of the contribution	Automatic Detection of Similarity of Programs in Karel Programming Language based on Natural Language Processing Techniques
Original language	Spanish
Pages (from-to)	279-288
Number of pages	10
Journal	Computacion y Sistemas
Volume	20
Issue number	2
DOIs	https://doi.org/10.13053/CyS-20-2-2369
State	Published - 2016

Access to Document

10.13053/CyS-20-2-2369

Cite this

@article{296fd58987c24ac895e5bd2be1b79442,

title = "Detecci on autoḿatica de similitud entre programas del lenguaje de programaci on Karel basada en t ecnicas de procesamiento de lenguaje natural",

abstract = "In this paper, we present a method for calculating similarity between programs (source codes). One of the applications of the task is detection of code reuse, for example, in the case of plagiarism. The Karel programming language is used for experiments. In order to determine similarity between Karel programs and/or similar software solutions, we make use of techniques from the fields of natural language processing and information retrieval. These techniques use representations of documents as vectors of features and their values. Usually, the features are n-grams of words or n-grams of characters. In addition, we consider application of the latent semantic analysis for reduction of the number of dimensions of the vector space. Finally, we use a supervised machine learning approach for classification of texts (or programs, which are texts as well) based on their similarity. For evaluation of the proposed method, two corpora were developed: The first corpus is composed of 100 different programs with a total of 9,341 source codes. The second corpus consists of 34 tasks with a total of 374 codes, which are grouped by the proposed solution. Our experiments showed that for the first corpus, the best results were obtained using trigrams of terms (words) accompanied with application of latent semantic analysis, while for the second corpus, the best representation was achieved using character trigrams.",

keywords = "Information retrieval, Latent semantic analysis, N-grams, Natural language processing, Program, Similarity, Source code",

author = "Grigori Sidorov and Romero, {Martin Ibarra} and Ilia Markov and Rafael Guzman-Cabrera and Liliana Chanona-Her{\'n}andez and {Vel asquez}, Francisco",

year = "2016",

doi = "10.13053/CyS-20-2-2369",

language = "Espa{\~n}ol",

volume = "20",

pages = "279--288",

journal = "Computacion y Sistemas",

issn = "1405-5546",

number = "2",

}

TY - JOUR

T1 - Detecci on autoḿatica de similitud entre programas del lenguaje de programaci on Karel basada en t ecnicas de procesamiento de lenguaje natural

AU - Sidorov, Grigori

AU - Romero, Martin Ibarra

AU - Markov, Ilia

AU - Guzman-Cabrera, Rafael

AU - Chanona-Herńandez, Liliana

AU - Vel asquez, Francisco

PY - 2016

Y1 - 2016

N2 - In this paper, we present a method for calculating similarity between programs (source codes). One of the applications of the task is detection of code reuse, for example, in the case of plagiarism. The Karel programming language is used for experiments. In order to determine similarity between Karel programs and/or similar software solutions, we make use of techniques from the fields of natural language processing and information retrieval. These techniques use representations of documents as vectors of features and their values. Usually, the features are n-grams of words or n-grams of characters. In addition, we consider application of the latent semantic analysis for reduction of the number of dimensions of the vector space. Finally, we use a supervised machine learning approach for classification of texts (or programs, which are texts as well) based on their similarity. For evaluation of the proposed method, two corpora were developed: The first corpus is composed of 100 different programs with a total of 9,341 source codes. The second corpus consists of 34 tasks with a total of 374 codes, which are grouped by the proposed solution. Our experiments showed that for the first corpus, the best results were obtained using trigrams of terms (words) accompanied with application of latent semantic analysis, while for the second corpus, the best representation was achieved using character trigrams.

AB - In this paper, we present a method for calculating similarity between programs (source codes). One of the applications of the task is detection of code reuse, for example, in the case of plagiarism. The Karel programming language is used for experiments. In order to determine similarity between Karel programs and/or similar software solutions, we make use of techniques from the fields of natural language processing and information retrieval. These techniques use representations of documents as vectors of features and their values. Usually, the features are n-grams of words or n-grams of characters. In addition, we consider application of the latent semantic analysis for reduction of the number of dimensions of the vector space. Finally, we use a supervised machine learning approach for classification of texts (or programs, which are texts as well) based on their similarity. For evaluation of the proposed method, two corpora were developed: The first corpus is composed of 100 different programs with a total of 9,341 source codes. The second corpus consists of 34 tasks with a total of 374 codes, which are grouped by the proposed solution. Our experiments showed that for the first corpus, the best results were obtained using trigrams of terms (words) accompanied with application of latent semantic analysis, while for the second corpus, the best representation was achieved using character trigrams.

KW - Information retrieval

KW - Latent semantic analysis

KW - N-grams

KW - Natural language processing

KW - Program

KW - Similarity

KW - Source code

UR - http://www.scopus.com/inward/record.url?scp=84976892925&partnerID=8YFLogxK

U2 - 10.13053/CyS-20-2-2369

DO - 10.13053/CyS-20-2-2369

M3 - Artículo

AN - SCOPUS:84976892925

SN - 1405-5546

VL - 20

SP - 279

EP - 288

JO - Computacion y Sistemas

JF - Computacion y Sistemas

IS - 2

ER -

Detecci on autoḿatica de similitud entre programas del lenguaje de programaci on Karel basada en t ecnicas de procesamiento de lenguaje natural

Abstract

Access to Document

Other files and links

Fingerprint

Cite this