TY - GEN
T1 - Automatic malware clustering using word embeddings and unsupervised learning
AU - Duarte-Garcia, Hugo Leonardo
AU - Cortez-Marquez, Alberto
AU - Sanchez-Perez, Gabriel
AU - Perez-Meana, Hector
AU - Toscano-Medina, Karina
AU - Hernandez-Suarez, Aldo
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - Malware has been established as one of the major threats in the cyberspace. Current mitigation efforts are focused in suspicious files disclosure, omitting key aspects in detection, such as category clustering. While state-of-The-Art provides significant advances in machine learning-based malware classification, most works solve binary classification problems. In this article, a methodology for automatic clustering of malware using NLP and unsupervised learning techniques is proposed. The latter is done by identifying malicious system calls (syscalls) from different binaries; then modelled in a textually manner to extract the most relevant features employing a statistical technique named TF-IDF. Then, a semantic and contextual representation of each syscall is computed by Word2Vec, a well-known word embedding algorithm. Weighted syscalls are subjected to KNN algorithm to find latent malware categories. A case study proves it is possible to cluster at least 60 new malware categories.
AB - Malware has been established as one of the major threats in the cyberspace. Current mitigation efforts are focused in suspicious files disclosure, omitting key aspects in detection, such as category clustering. While state-of-The-Art provides significant advances in machine learning-based malware classification, most works solve binary classification problems. In this article, a methodology for automatic clustering of malware using NLP and unsupervised learning techniques is proposed. The latter is done by identifying malicious system calls (syscalls) from different binaries; then modelled in a textually manner to extract the most relevant features employing a statistical technique named TF-IDF. Then, a semantic and contextual representation of each syscall is computed by Word2Vec, a well-known word embedding algorithm. Weighted syscalls are subjected to KNN algorithm to find latent malware categories. A case study proves it is possible to cluster at least 60 new malware categories.
UR - http://www.scopus.com/inward/record.url?scp=85068469467&partnerID=8YFLogxK
U2 - 10.1109/IWBF.2019.8739186
DO - 10.1109/IWBF.2019.8739186
M3 - Contribución a la conferencia
AN - SCOPUS:85068469467
T3 - 2019 7th International Workshop on Biometrics and Forensics, IWBF 2019
BT - 2019 7th International Workshop on Biometrics and Forensics, IWBF 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 7th International Workshop on Biometrics and Forensics, IWBF 2019
Y2 - 2 May 2019 through 3 May 2019
ER -