TY - GEN
T1 - A Semi-supervised learning methodology for malware categorization using weighted word embeddings
AU - Duarte-Garcia, Hugo Leonardo
AU - Morales-Medina, Carlos Domenick
AU - Hernandez-Suarez, Aldo
AU - Sanchez-Perez, Gabriel
AU - Toscano-Medina, Karina
AU - Perez-Meana, Hector
AU - Sanchez, Victor
AU - Sandoval Orozco, Ana Lucila
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/6
Y1 - 2019/6
N2 - Due to the vertiginous growth of malicious actors, malware has been crafted, distributed and propagated around the world with new and sophisticated techniques. Classical malware detection procedures, mostly based on signatures and heuristic searches, are now being replaced with machine learning-based (ML) solutions. However, some challenges are still present. Firstly, supervised approaches use anti-virus tags to create hand-crafted datasets, resulting in a lack of taxonomy and uncertainty if a given observation is classified with a proper label. Secondly, off-line and feed-forward approaches may result in complex and time consuming feature extraction tasks. In this work, we propose a novel method that reinforces malware characterization by capturing rich relevance and contextual patterns into an n-dimensional weighted word embedding vector (WEV) space. Results prove that by clustering similar WEVs via unsupervised learning, malware can be categorized into four major families, improving detection with less resources.
AB - Due to the vertiginous growth of malicious actors, malware has been crafted, distributed and propagated around the world with new and sophisticated techniques. Classical malware detection procedures, mostly based on signatures and heuristic searches, are now being replaced with machine learning-based (ML) solutions. However, some challenges are still present. Firstly, supervised approaches use anti-virus tags to create hand-crafted datasets, resulting in a lack of taxonomy and uncertainty if a given observation is classified with a proper label. Secondly, off-line and feed-forward approaches may result in complex and time consuming feature extraction tasks. In this work, we propose a novel method that reinforces malware characterization by capturing rich relevance and contextual patterns into an n-dimensional weighted word embedding vector (WEV) space. Results prove that by clustering similar WEVs via unsupervised learning, malware can be categorized into four major families, improving detection with less resources.
KW - Clustering
KW - Machine-learning
KW - Malware
KW - Windows-Api
KW - Word2vec
UR - http://www.scopus.com/inward/record.url?scp=85071907670&partnerID=8YFLogxK
U2 - 10.1109/EuroSPW.2019.00033
DO - 10.1109/EuroSPW.2019.00033
M3 - Contribución a la conferencia
AN - SCOPUS:85071907670
T3 - Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019
SP - 238
EP - 246
BT - Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019
Y2 - 17 June 2019 through 19 June 2019
ER -