Automatic malware clustering using word embeddings and unsupervised learning

Hugo Leonardo Duarte-Garcia, Alberto Cortez-Marquez, Gabriel Sanchez-Perez, Hector Perez-Meana, Karina Toscano-Medina, Aldo Hernandez-Suarez

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

Malware has been established as one of the major threats in the cyberspace. Current mitigation efforts are focused in suspicious files disclosure, omitting key aspects in detection, such as category clustering. While state-of-The-Art provides significant advances in machine learning-based malware classification, most works solve binary classification problems. In this article, a methodology for automatic clustering of malware using NLP and unsupervised learning techniques is proposed. The latter is done by identifying malicious system calls (syscalls) from different binaries; then modelled in a textually manner to extract the most relevant features employing a statistical technique named TF-IDF. Then, a semantic and contextual representation of each syscall is computed by Word2Vec, a well-known word embedding algorithm. Weighted syscalls are subjected to KNN algorithm to find latent malware categories. A case study proves it is possible to cluster at least 60 new malware categories.

Original languageEnglish
Title of host publication2019 7th International Workshop on Biometrics and Forensics, IWBF 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728106229
DOIs
StatePublished - May 2019
Event7th International Workshop on Biometrics and Forensics, IWBF 2019 - Cancun, Mexico
Duration: 2 May 20193 May 2019

Publication series

Name2019 7th International Workshop on Biometrics and Forensics, IWBF 2019

Conference

Conference7th International Workshop on Biometrics and Forensics, IWBF 2019
Country/TerritoryMexico
CityCancun
Period2/05/193/05/19

Fingerprint

Dive into the research topics of 'Automatic malware clustering using word embeddings and unsupervised learning'. Together they form a unique fingerprint.

Cite this