Automatic malware clustering using word embeddings and unsupervised learning

Hugo Leonardo Duarte-Garcia, Alberto Cortez-Marquez, Gabriel Sanchez-Perez, Hector Perez-Meana, Karina Toscano-Medina, Aldo Hernandez-Suarez

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

5 Citas (Scopus)

Resumen

Malware has been established as one of the major threats in the cyberspace. Current mitigation efforts are focused in suspicious files disclosure, omitting key aspects in detection, such as category clustering. While state-of-The-Art provides significant advances in machine learning-based malware classification, most works solve binary classification problems. In this article, a methodology for automatic clustering of malware using NLP and unsupervised learning techniques is proposed. The latter is done by identifying malicious system calls (syscalls) from different binaries; then modelled in a textually manner to extract the most relevant features employing a statistical technique named TF-IDF. Then, a semantic and contextual representation of each syscall is computed by Word2Vec, a well-known word embedding algorithm. Weighted syscalls are subjected to KNN algorithm to find latent malware categories. A case study proves it is possible to cluster at least 60 new malware categories.

Idioma originalInglés
Título de la publicación alojada2019 7th International Workshop on Biometrics and Forensics, IWBF 2019
EditorialInstitute of Electrical and Electronics Engineers Inc.
ISBN (versión digital)9781728106229
DOI
EstadoPublicada - may. 2019
Evento7th International Workshop on Biometrics and Forensics, IWBF 2019 - Cancun, México
Duración: 2 may. 20193 may. 2019

Serie de la publicación

Nombre2019 7th International Workshop on Biometrics and Forensics, IWBF 2019

Conferencia

Conferencia7th International Workshop on Biometrics and Forensics, IWBF 2019
País/TerritorioMéxico
CiudadCancun
Período2/05/193/05/19

Huella

Profundice en los temas de investigación de 'Automatic malware clustering using word embeddings and unsupervised learning'. En conjunto forman una huella única.

Citar esto