Automatic malware clustering using word embeddings and unsupervised learning

Hugo Leonardo Duarte-Garcia; Alberto Cortez-Marquez; Gabriel Sanchez-Perez; Hector Perez-Meana; Karina Toscano-Medina; Aldo Hernandez-Suarez

doi:10.1109/IWBF.2019.8739186

Automatic malware clustering using word embeddings and unsupervised learning

Hugo Leonardo Duarte-Garcia, Alberto Cortez-Marquez, Gabriel Sanchez-Perez, Hector Perez-Meana, Karina Toscano-Medina, Aldo Hernandez-Suarez

Escuela Superior de Ingeniería Mecánica y Eléctrica (ESIME), Unidad Culhuacán

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

5 Citas (Scopus)

Resumen

Malware has been established as one of the major threats in the cyberspace. Current mitigation efforts are focused in suspicious files disclosure, omitting key aspects in detection, such as category clustering. While state-of-The-Art provides significant advances in machine learning-based malware classification, most works solve binary classification problems. In this article, a methodology for automatic clustering of malware using NLP and unsupervised learning techniques is proposed. The latter is done by identifying malicious system calls (syscalls) from different binaries; then modelled in a textually manner to extract the most relevant features employing a statistical technique named TF-IDF. Then, a semantic and contextual representation of each syscall is computed by Word2Vec, a well-known word embedding algorithm. Weighted syscalls are subjected to KNN algorithm to find latent malware categories. A case study proves it is possible to cluster at least 60 new malware categories.

Idioma original	Inglés
Título de la publicación alojada	2019 7th International Workshop on Biometrics and Forensics, IWBF 2019
Editorial	Institute of Electrical and Electronics Engineers Inc.
ISBN (versión digital)	9781728106229
DOI	https://doi.org/10.1109/IWBF.2019.8739186
Estado	Publicada - may. 2019
Evento	7th International Workshop on Biometrics and Forensics, IWBF 2019 - Cancun, México Duración: 2 may. 2019 → 3 may. 2019

Serie de la publicación

Nombre	2019 7th International Workshop on Biometrics and Forensics, IWBF 2019

Conferencia

Conferencia	7th International Workshop on Biometrics and Forensics, IWBF 2019
País/Territorio	México
Ciudad	Cancun
Período	2/05/19 → 3/05/19

Acceder al documento

10.1109/IWBF.2019.8739186

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

Duarte-Garcia, H. L., Cortez-Marquez, A., Sanchez-Perez, G., Perez-Meana, H., Toscano-Medina, K., & Hernandez-Suarez, A. (2019). Automatic malware clustering using word embeddings and unsupervised learning. En 2019 7th International Workshop on Biometrics and Forensics, IWBF 2019 Artículo 8739186 (2019 7th International Workshop on Biometrics and Forensics, IWBF 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IWBF.2019.8739186

Duarte-Garcia, Hugo Leonardo ; Cortez-Marquez, Alberto ; Sanchez-Perez, Gabriel et al. / Automatic malware clustering using word embeddings and unsupervised learning. 2019 7th International Workshop on Biometrics and Forensics, IWBF 2019. Institute of Electrical and Electronics Engineers Inc., 2019. (2019 7th International Workshop on Biometrics and Forensics, IWBF 2019).

@inproceedings{7443b03f8a8e414b87e9b7d89d028dbb,

title = "Automatic malware clustering using word embeddings and unsupervised learning",

abstract = "Malware has been established as one of the major threats in the cyberspace. Current mitigation efforts are focused in suspicious files disclosure, omitting key aspects in detection, such as category clustering. While state-of-The-Art provides significant advances in machine learning-based malware classification, most works solve binary classification problems. In this article, a methodology for automatic clustering of malware using NLP and unsupervised learning techniques is proposed. The latter is done by identifying malicious system calls (syscalls) from different binaries; then modelled in a textually manner to extract the most relevant features employing a statistical technique named TF-IDF. Then, a semantic and contextual representation of each syscall is computed by Word2Vec, a well-known word embedding algorithm. Weighted syscalls are subjected to KNN algorithm to find latent malware categories. A case study proves it is possible to cluster at least 60 new malware categories.",

author = "Duarte-Garcia, {Hugo Leonardo} and Alberto Cortez-Marquez and Gabriel Sanchez-Perez and Hector Perez-Meana and Karina Toscano-Medina and Aldo Hernandez-Suarez",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 7th International Workshop on Biometrics and Forensics, IWBF 2019 ; Conference date: 02-05-2019 Through 03-05-2019",

year = "2019",

month = may,

doi = "10.1109/IWBF.2019.8739186",

language = "Ingl{\'e}s",

series = "2019 7th International Workshop on Biometrics and Forensics, IWBF 2019",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "2019 7th International Workshop on Biometrics and Forensics, IWBF 2019",

address = "Estados Unidos",

}

Duarte-Garcia, HL, Cortez-Marquez, A, Sanchez-Perez, G , Perez-Meana, H , Toscano-Medina, K & Hernandez-Suarez, A 2019, Automatic malware clustering using word embeddings and unsupervised learning. En 2019 7th International Workshop on Biometrics and Forensics, IWBF 2019., 8739186, 2019 7th International Workshop on Biometrics and Forensics, IWBF 2019, Institute of Electrical and Electronics Engineers Inc., 7th International Workshop on Biometrics and Forensics, IWBF 2019, Cancun, México, 2/05/19. https://doi.org/10.1109/IWBF.2019.8739186

Automatic malware clustering using word embeddings and unsupervised learning. / Duarte-Garcia, Hugo Leonardo; Cortez-Marquez, Alberto; Sanchez-Perez, Gabriel et al.
2019 7th International Workshop on Biometrics and Forensics, IWBF 2019. Institute of Electrical and Electronics Engineers Inc., 2019. 8739186 (2019 7th International Workshop on Biometrics and Forensics, IWBF 2019).

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

TY - GEN

T1 - Automatic malware clustering using word embeddings and unsupervised learning

AU - Duarte-Garcia, Hugo Leonardo

AU - Cortez-Marquez, Alberto

AU - Sanchez-Perez, Gabriel

AU - Perez-Meana, Hector

AU - Toscano-Medina, Karina

AU - Hernandez-Suarez, Aldo

PY - 2019/5

Y1 - 2019/5

N2 - Malware has been established as one of the major threats in the cyberspace. Current mitigation efforts are focused in suspicious files disclosure, omitting key aspects in detection, such as category clustering. While state-of-The-Art provides significant advances in machine learning-based malware classification, most works solve binary classification problems. In this article, a methodology for automatic clustering of malware using NLP and unsupervised learning techniques is proposed. The latter is done by identifying malicious system calls (syscalls) from different binaries; then modelled in a textually manner to extract the most relevant features employing a statistical technique named TF-IDF. Then, a semantic and contextual representation of each syscall is computed by Word2Vec, a well-known word embedding algorithm. Weighted syscalls are subjected to KNN algorithm to find latent malware categories. A case study proves it is possible to cluster at least 60 new malware categories.

AB - Malware has been established as one of the major threats in the cyberspace. Current mitigation efforts are focused in suspicious files disclosure, omitting key aspects in detection, such as category clustering. While state-of-The-Art provides significant advances in machine learning-based malware classification, most works solve binary classification problems. In this article, a methodology for automatic clustering of malware using NLP and unsupervised learning techniques is proposed. The latter is done by identifying malicious system calls (syscalls) from different binaries; then modelled in a textually manner to extract the most relevant features employing a statistical technique named TF-IDF. Then, a semantic and contextual representation of each syscall is computed by Word2Vec, a well-known word embedding algorithm. Weighted syscalls are subjected to KNN algorithm to find latent malware categories. A case study proves it is possible to cluster at least 60 new malware categories.

UR - http://www.scopus.com/inward/record.url?scp=85068469467&partnerID=8YFLogxK

U2 - 10.1109/IWBF.2019.8739186

DO - 10.1109/IWBF.2019.8739186

M3 - Contribución a la conferencia

AN - SCOPUS:85068469467

T3 - 2019 7th International Workshop on Biometrics and Forensics, IWBF 2019

BT - 2019 7th International Workshop on Biometrics and Forensics, IWBF 2019

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 7th International Workshop on Biometrics and Forensics, IWBF 2019

Y2 - 2 May 2019 through 3 May 2019

ER -

Duarte-Garcia HL, Cortez-Marquez A, Sanchez-Perez G , Perez-Meana H , Toscano-Medina K , Hernandez-Suarez A. Automatic malware clustering using word embeddings and unsupervised learning. En 2019 7th International Workshop on Biometrics and Forensics, IWBF 2019. Institute of Electrical and Electronics Engineers Inc. 2019. 8739186. (2019 7th International Workshop on Biometrics and Forensics, IWBF 2019). doi: 10.1109/IWBF.2019.8739186

Automatic malware clustering using word embeddings and unsupervised learning

Resumen

Serie de la publicación

Conferencia

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto