A Semi-supervised learning methodology for malware categorization using weighted word embeddings

Hugo Leonardo Duarte-Garcia, Carlos Domenick Morales-Medina, Aldo Hernandez-Suarez, Gabriel Sanchez-Perez, Karina Toscano-Medina, Hector Perez-Meana, Victor Sanchez, Ana Lucila Sandoval Orozco

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

Due to the vertiginous growth of malicious actors, malware has been crafted, distributed and propagated around the world with new and sophisticated techniques. Classical malware detection procedures, mostly based on signatures and heuristic searches, are now being replaced with machine learning-based (ML) solutions. However, some challenges are still present. Firstly, supervised approaches use anti-virus tags to create hand-crafted datasets, resulting in a lack of taxonomy and uncertainty if a given observation is classified with a proper label. Secondly, off-line and feed-forward approaches may result in complex and time consuming feature extraction tasks. In this work, we propose a novel method that reinforces malware characterization by capturing rich relevance and contextual patterns into an n-dimensional weighted word embedding vector (WEV) space. Results prove that by clustering similar WEVs via unsupervised learning, malware can be categorized into four major families, improving detection with less resources.

Original languageEnglish
Title of host publicationProceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages238-246
Number of pages9
ISBN (Electronic)9781728130262
DOIs
StatePublished - Jun 2019
Event4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019 - Stockholm, Sweden
Duration: 17 Jun 201919 Jun 2019

Publication series

NameProceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019

Conference

Conference4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019
Country/TerritorySweden
CityStockholm
Period17/06/1919/06/19

Keywords

  • Clustering
  • Machine-learning
  • Malware
  • Windows-Api
  • Word2vec

Fingerprint

Dive into the research topics of 'A Semi-supervised learning methodology for malware categorization using weighted word embeddings'. Together they form a unique fingerprint.

Cite this