Synthetic minority oversampling technique for optimizing classification tasks in botnet and intrusion-detection-system datasets

David Gonzalez-Cuautle; Aldo Hernandez-Suarez; Gabriel Sanchez-Perez; Linda Karina Toscano-Medina; Jose Portillo-Portillo; Jesus Olivares-Mercado; Hector Manuel Perez-Meana; Ana Lucila Sandoval-Orozco

doi:10.3390/app10030794

Synthetic minority oversampling technique for optimizing classification tasks in botnet and intrusion-detection-system datasets

David Gonzalez-Cuautle, Aldo Hernandez-Suarez, Gabriel Sanchez-Perez, Linda Karina Toscano-Medina, Jose Portillo-Portillo, Jesus Olivares-Mercado, Hector Manuel Perez-Meana, Ana Lucila Sandoval-Orozco

Escuela Superior de Ingeniería Mecánica y Eléctrica (ESIME), Unidad Culhuacán

Research output: Contribution to journal › Article › peer-review

53 Scopus citations

Abstract

Presently, security is a hot research topic due to the impact in daily information infrastructure. Machine-learning solutions have been improving classical detection practices, but detection tasks employ irregular amounts of data since the number of instances that represent one or several malicious samples can significantly vary. In highly unbalanced data, classification models regularly have high precision with respect to the majority class, while minority classes are considered noise due to the lack of information that they provide. Well-known datasets used for malware-based analyses like botnet attacks and Intrusion Detection Systems (IDS) mainly comprise logs, records, or network-traffic captures that do not provide an ideal source of evidence as a result of obtaining raw data. As an example, the numbers of abnormal and constant connections generated by either botnets or intruders within a network are considerably smaller than those from benign applications. In most cases, inadequate dataset design may lead to the downgrade of a learning algorithm, resulting in overfitting and poor classification rates. To address these problems, we propose a resampling method, the Synthetic Minority Oversampling Technique (SMOTE) with a grid-search algorithm optimization procedure. This work demonstrates classification-result improvements for botnet and IDS datasets by merging synthetically generated balanced data and tuning different supervised-learning algorithms.

Original language	English
Article number	794
Journal	Applied Sciences (Switzerland)
Volume	10
Issue number	3
DOIs	https://doi.org/10.3390/app10030794
State	Published - 1 Feb 2020

Keywords

Botnet detection
Datasets
Imbalanced data
Machine learning
Predictive models
Synthetic minority oversampling technique

Access to Document

10.3390/app10030794

Cite this

Gonzalez-Cuautle, D., Hernandez-Suarez, A., Sanchez-Perez, G., Toscano-Medina, L. K., Portillo-Portillo, J., Olivares-Mercado, J., Perez-Meana, H. M., & Sandoval-Orozco, A. L. (2020). Synthetic minority oversampling technique for optimizing classification tasks in botnet and intrusion-detection-system datasets. Applied Sciences (Switzerland), 10(3), Article 794. https://doi.org/10.3390/app10030794

@article{eb97ec9b214e4646a8f5bba80f1166f0,

title = "Synthetic minority oversampling technique for optimizing classification tasks in botnet and intrusion-detection-system datasets",

abstract = "Presently, security is a hot research topic due to the impact in daily information infrastructure. Machine-learning solutions have been improving classical detection practices, but detection tasks employ irregular amounts of data since the number of instances that represent one or several malicious samples can significantly vary. In highly unbalanced data, classification models regularly have high precision with respect to the majority class, while minority classes are considered noise due to the lack of information that they provide. Well-known datasets used for malware-based analyses like botnet attacks and Intrusion Detection Systems (IDS) mainly comprise logs, records, or network-traffic captures that do not provide an ideal source of evidence as a result of obtaining raw data. As an example, the numbers of abnormal and constant connections generated by either botnets or intruders within a network are considerably smaller than those from benign applications. In most cases, inadequate dataset design may lead to the downgrade of a learning algorithm, resulting in overfitting and poor classification rates. To address these problems, we propose a resampling method, the Synthetic Minority Oversampling Technique (SMOTE) with a grid-search algorithm optimization procedure. This work demonstrates classification-result improvements for botnet and IDS datasets by merging synthetically generated balanced data and tuning different supervised-learning algorithms.",

keywords = "Botnet detection, Datasets, Imbalanced data, Machine learning, Predictive models, Synthetic minority oversampling technique",

author = "David Gonzalez-Cuautle and Aldo Hernandez-Suarez and Gabriel Sanchez-Perez and Toscano-Medina, {Linda Karina} and Jose Portillo-Portillo and Jesus Olivares-Mercado and Perez-Meana, {Hector Manuel} and Sandoval-Orozco, {Ana Lucila}",

note = "Publisher Copyright: {\textcopyright} 2020 by the authors.",

year = "2020",

month = feb,

day = "1",

doi = "10.3390/app10030794",

language = "Ingl{\'e}s",

volume = "10",

journal = "Applied Sciences (Switzerland)",

issn = "2076-3417",

number = "3",

}

Gonzalez-Cuautle, D, Hernandez-Suarez, A , Sanchez-Perez, G , Toscano-Medina, LK , Portillo-Portillo, J , Olivares-Mercado, J , Perez-Meana, HM & Sandoval-Orozco, AL 2020, 'Synthetic minority oversampling technique for optimizing classification tasks in botnet and intrusion-detection-system datasets', Applied Sciences (Switzerland), vol. 10, no. 3, 794. https://doi.org/10.3390/app10030794

TY - JOUR

T1 - Synthetic minority oversampling technique for optimizing classification tasks in botnet and intrusion-detection-system datasets

AU - Gonzalez-Cuautle, David

AU - Hernandez-Suarez, Aldo

AU - Sanchez-Perez, Gabriel

AU - Toscano-Medina, Linda Karina

AU - Portillo-Portillo, Jose

AU - Olivares-Mercado, Jesus

AU - Perez-Meana, Hector Manuel

AU - Sandoval-Orozco, Ana Lucila

PY - 2020/2/1

Y1 - 2020/2/1

N2 - Presently, security is a hot research topic due to the impact in daily information infrastructure. Machine-learning solutions have been improving classical detection practices, but detection tasks employ irregular amounts of data since the number of instances that represent one or several malicious samples can significantly vary. In highly unbalanced data, classification models regularly have high precision with respect to the majority class, while minority classes are considered noise due to the lack of information that they provide. Well-known datasets used for malware-based analyses like botnet attacks and Intrusion Detection Systems (IDS) mainly comprise logs, records, or network-traffic captures that do not provide an ideal source of evidence as a result of obtaining raw data. As an example, the numbers of abnormal and constant connections generated by either botnets or intruders within a network are considerably smaller than those from benign applications. In most cases, inadequate dataset design may lead to the downgrade of a learning algorithm, resulting in overfitting and poor classification rates. To address these problems, we propose a resampling method, the Synthetic Minority Oversampling Technique (SMOTE) with a grid-search algorithm optimization procedure. This work demonstrates classification-result improvements for botnet and IDS datasets by merging synthetically generated balanced data and tuning different supervised-learning algorithms.

AB - Presently, security is a hot research topic due to the impact in daily information infrastructure. Machine-learning solutions have been improving classical detection practices, but detection tasks employ irregular amounts of data since the number of instances that represent one or several malicious samples can significantly vary. In highly unbalanced data, classification models regularly have high precision with respect to the majority class, while minority classes are considered noise due to the lack of information that they provide. Well-known datasets used for malware-based analyses like botnet attacks and Intrusion Detection Systems (IDS) mainly comprise logs, records, or network-traffic captures that do not provide an ideal source of evidence as a result of obtaining raw data. As an example, the numbers of abnormal and constant connections generated by either botnets or intruders within a network are considerably smaller than those from benign applications. In most cases, inadequate dataset design may lead to the downgrade of a learning algorithm, resulting in overfitting and poor classification rates. To address these problems, we propose a resampling method, the Synthetic Minority Oversampling Technique (SMOTE) with a grid-search algorithm optimization procedure. This work demonstrates classification-result improvements for botnet and IDS datasets by merging synthetically generated balanced data and tuning different supervised-learning algorithms.

KW - Botnet detection

KW - Datasets

KW - Imbalanced data

KW - Machine learning

KW - Predictive models

KW - Synthetic minority oversampling technique

UR - http://www.scopus.com/inward/record.url?scp=85081623589&partnerID=8YFLogxK

U2 - 10.3390/app10030794

DO - 10.3390/app10030794

M3 - Artículo

SN - 2076-3417

VL - 10

JO - Applied Sciences (Switzerland)

JF - Applied Sciences (Switzerland)

IS - 3

M1 - 794

ER -

Synthetic minority oversampling technique for optimizing classification tasks in botnet and intrusion-detection-system datasets

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this