A Semi-supervised learning methodology for malware categorization using weighted word embeddings

Hugo Leonardo Duarte-Garcia; Carlos Domenick Morales-Medina; Aldo Hernandez-Suarez; Gabriel Sanchez-Perez; Karina Toscano-Medina; Hector Perez-Meana; Victor Sanchez; Ana Lucila Sandoval Orozco

doi:10.1109/EuroSPW.2019.00033

A Semi-supervised learning methodology for malware categorization using weighted word embeddings

Hugo Leonardo Duarte-Garcia, Carlos Domenick Morales-Medina, Aldo Hernandez-Suarez, Gabriel Sanchez-Perez, Karina Toscano-Medina, Hector Perez-Meana, Victor Sanchez, Ana Lucila Sandoval Orozco

Escuela Superior de Ingeniería Mecánica y Eléctrica (ESIME), Unidad Culhuacán

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

9 Citas (Scopus)

Resumen

Due to the vertiginous growth of malicious actors, malware has been crafted, distributed and propagated around the world with new and sophisticated techniques. Classical malware detection procedures, mostly based on signatures and heuristic searches, are now being replaced with machine learning-based (ML) solutions. However, some challenges are still present. Firstly, supervised approaches use anti-virus tags to create hand-crafted datasets, resulting in a lack of taxonomy and uncertainty if a given observation is classified with a proper label. Secondly, off-line and feed-forward approaches may result in complex and time consuming feature extraction tasks. In this work, we propose a novel method that reinforces malware characterization by capturing rich relevance and contextual patterns into an n-dimensional weighted word embedding vector (WEV) space. Results prove that by clustering similar WEVs via unsupervised learning, malware can be categorized into four major families, improving detection with less resources.

Idioma original	Inglés
Título de la publicación alojada	Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019
Editorial	Institute of Electrical and Electronics Engineers Inc.
Páginas	238-246
Número de páginas	9
ISBN (versión digital)	9781728130262
DOI	https://doi.org/10.1109/EuroSPW.2019.00033
Estado	Publicada - jun. 2019
Evento	4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019 - Stockholm, Suecia Duración: 17 jun. 2019 → 19 jun. 2019

Serie de la publicación

Nombre	Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019

Conferencia

Conferencia	4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019
País/Territorio	Suecia
Ciudad	Stockholm
Período	17/06/19 → 19/06/19

Acceder al documento

10.1109/EuroSPW.2019.00033

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

Duarte-Garcia, H. L., Morales-Medina, C. D., Hernandez-Suarez, A., Sanchez-Perez, G., Toscano-Medina, K., Perez-Meana, H., Sanchez, V., & Sandoval Orozco, A. L. (2019). A Semi-supervised learning methodology for malware categorization using weighted word embeddings. En Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019 (pp. 238-246). Artículo 8802412 (Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/EuroSPW.2019.00033

Duarte-Garcia, Hugo Leonardo ; Morales-Medina, Carlos Domenick ; Hernandez-Suarez, Aldo et al. / A Semi-supervised learning methodology for malware categorization using weighted word embeddings. Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 238-246 (Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019).

@inproceedings{947ae35a1236452e8389411ab342b241,

title = "A Semi-supervised learning methodology for malware categorization using weighted word embeddings",

abstract = "Due to the vertiginous growth of malicious actors, malware has been crafted, distributed and propagated around the world with new and sophisticated techniques. Classical malware detection procedures, mostly based on signatures and heuristic searches, are now being replaced with machine learning-based (ML) solutions. However, some challenges are still present. Firstly, supervised approaches use anti-virus tags to create hand-crafted datasets, resulting in a lack of taxonomy and uncertainty if a given observation is classified with a proper label. Secondly, off-line and feed-forward approaches may result in complex and time consuming feature extraction tasks. In this work, we propose a novel method that reinforces malware characterization by capturing rich relevance and contextual patterns into an n-dimensional weighted word embedding vector (WEV) space. Results prove that by clustering similar WEVs via unsupervised learning, malware can be categorized into four major families, improving detection with less resources.",

keywords = "Clustering, Machine-learning, Malware, Windows-Api, Word2vec",

author = "Duarte-Garcia, {Hugo Leonardo} and Morales-Medina, {Carlos Domenick} and Aldo Hernandez-Suarez and Gabriel Sanchez-Perez and Karina Toscano-Medina and Hector Perez-Meana and Victor Sanchez and {Sandoval Orozco}, {Ana Lucila}",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019 ; Conference date: 17-06-2019 Through 19-06-2019",

year = "2019",

month = jun,

doi = "10.1109/EuroSPW.2019.00033",

language = "Ingl{\'e}s",

series = "Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "238--246",

booktitle = "Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019",

address = "Estados Unidos",

}

Duarte-Garcia, HL, Morales-Medina, CD, Hernandez-Suarez, A , Sanchez-Perez, G , Toscano-Medina, K , Perez-Meana, H, Sanchez, V & Sandoval Orozco, AL 2019, A Semi-supervised learning methodology for malware categorization using weighted word embeddings. En Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019., 8802412, Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019, Institute of Electrical and Electronics Engineers Inc., pp. 238-246, 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019, Stockholm, Suecia, 17/06/19. https://doi.org/10.1109/EuroSPW.2019.00033

A Semi-supervised learning methodology for malware categorization using weighted word embeddings. / Duarte-Garcia, Hugo Leonardo; Morales-Medina, Carlos Domenick; Hernandez-Suarez, Aldo et al.
Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019. Institute of Electrical and Electronics Engineers Inc., 2019. p. 238-246 8802412 (Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019).

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

TY - GEN

T1 - A Semi-supervised learning methodology for malware categorization using weighted word embeddings

AU - Duarte-Garcia, Hugo Leonardo

AU - Morales-Medina, Carlos Domenick

AU - Hernandez-Suarez, Aldo

AU - Sanchez-Perez, Gabriel

AU - Toscano-Medina, Karina

AU - Perez-Meana, Hector

AU - Sanchez, Victor

AU - Sandoval Orozco, Ana Lucila

PY - 2019/6

Y1 - 2019/6

N2 - Due to the vertiginous growth of malicious actors, malware has been crafted, distributed and propagated around the world with new and sophisticated techniques. Classical malware detection procedures, mostly based on signatures and heuristic searches, are now being replaced with machine learning-based (ML) solutions. However, some challenges are still present. Firstly, supervised approaches use anti-virus tags to create hand-crafted datasets, resulting in a lack of taxonomy and uncertainty if a given observation is classified with a proper label. Secondly, off-line and feed-forward approaches may result in complex and time consuming feature extraction tasks. In this work, we propose a novel method that reinforces malware characterization by capturing rich relevance and contextual patterns into an n-dimensional weighted word embedding vector (WEV) space. Results prove that by clustering similar WEVs via unsupervised learning, malware can be categorized into four major families, improving detection with less resources.

AB - Due to the vertiginous growth of malicious actors, malware has been crafted, distributed and propagated around the world with new and sophisticated techniques. Classical malware detection procedures, mostly based on signatures and heuristic searches, are now being replaced with machine learning-based (ML) solutions. However, some challenges are still present. Firstly, supervised approaches use anti-virus tags to create hand-crafted datasets, resulting in a lack of taxonomy and uncertainty if a given observation is classified with a proper label. Secondly, off-line and feed-forward approaches may result in complex and time consuming feature extraction tasks. In this work, we propose a novel method that reinforces malware characterization by capturing rich relevance and contextual patterns into an n-dimensional weighted word embedding vector (WEV) space. Results prove that by clustering similar WEVs via unsupervised learning, malware can be categorized into four major families, improving detection with less resources.

KW - Clustering

KW - Machine-learning

KW - Malware

KW - Windows-Api

KW - Word2vec

UR - http://www.scopus.com/inward/record.url?scp=85071907670&partnerID=8YFLogxK

U2 - 10.1109/EuroSPW.2019.00033

DO - 10.1109/EuroSPW.2019.00033

M3 - Contribución a la conferencia

AN - SCOPUS:85071907670

T3 - Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019

SP - 238

EP - 246

BT - Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019

Y2 - 17 June 2019 through 19 June 2019

ER -

Duarte-Garcia HL, Morales-Medina CD, Hernandez-Suarez A , Sanchez-Perez G , Toscano-Medina K , Perez-Meana H et al. A Semi-supervised learning methodology for malware categorization using weighted word embeddings. En Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019. Institute of Electrical and Electronics Engineers Inc. 2019. p. 238-246. 8802412. (Proceedings - 4th IEEE European Symposium on Security and Privacy Workshops, EUROS and PW 2019). doi: 10.1109/EuroSPW.2019.00033

A Semi-supervised learning methodology for malware categorization using weighted word embeddings

Resumen

Serie de la publicación

Conferencia

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto