TY - JOUR
T1 - Synthetic minority oversampling technique for optimizing classification tasks in botnet and intrusion-detection-system datasets
AU - Gonzalez-Cuautle, David
AU - Hernandez-Suarez, Aldo
AU - Sanchez-Perez, Gabriel
AU - Toscano-Medina, Linda Karina
AU - Portillo-Portillo, Jose
AU - Olivares-Mercado, Jesus
AU - Perez-Meana, Hector Manuel
AU - Sandoval-Orozco, Ana Lucila
N1 - Publisher Copyright:
© 2020 by the authors.
PY - 2020/2/1
Y1 - 2020/2/1
N2 - Presently, security is a hot research topic due to the impact in daily information infrastructure. Machine-learning solutions have been improving classical detection practices, but detection tasks employ irregular amounts of data since the number of instances that represent one or several malicious samples can significantly vary. In highly unbalanced data, classification models regularly have high precision with respect to the majority class, while minority classes are considered noise due to the lack of information that they provide. Well-known datasets used for malware-based analyses like botnet attacks and Intrusion Detection Systems (IDS) mainly comprise logs, records, or network-traffic captures that do not provide an ideal source of evidence as a result of obtaining raw data. As an example, the numbers of abnormal and constant connections generated by either botnets or intruders within a network are considerably smaller than those from benign applications. In most cases, inadequate dataset design may lead to the downgrade of a learning algorithm, resulting in overfitting and poor classification rates. To address these problems, we propose a resampling method, the Synthetic Minority Oversampling Technique (SMOTE) with a grid-search algorithm optimization procedure. This work demonstrates classification-result improvements for botnet and IDS datasets by merging synthetically generated balanced data and tuning different supervised-learning algorithms.
AB - Presently, security is a hot research topic due to the impact in daily information infrastructure. Machine-learning solutions have been improving classical detection practices, but detection tasks employ irregular amounts of data since the number of instances that represent one or several malicious samples can significantly vary. In highly unbalanced data, classification models regularly have high precision with respect to the majority class, while minority classes are considered noise due to the lack of information that they provide. Well-known datasets used for malware-based analyses like botnet attacks and Intrusion Detection Systems (IDS) mainly comprise logs, records, or network-traffic captures that do not provide an ideal source of evidence as a result of obtaining raw data. As an example, the numbers of abnormal and constant connections generated by either botnets or intruders within a network are considerably smaller than those from benign applications. In most cases, inadequate dataset design may lead to the downgrade of a learning algorithm, resulting in overfitting and poor classification rates. To address these problems, we propose a resampling method, the Synthetic Minority Oversampling Technique (SMOTE) with a grid-search algorithm optimization procedure. This work demonstrates classification-result improvements for botnet and IDS datasets by merging synthetically generated balanced data and tuning different supervised-learning algorithms.
KW - Botnet detection
KW - Datasets
KW - Imbalanced data
KW - Machine learning
KW - Predictive models
KW - Synthetic minority oversampling technique
UR - http://www.scopus.com/inward/record.url?scp=85081623589&partnerID=8YFLogxK
U2 - 10.3390/app10030794
DO - 10.3390/app10030794
M3 - Artículo
SN - 2076-3417
VL - 10
JO - Applied Sciences (Switzerland)
JF - Applied Sciences (Switzerland)
IS - 3
M1 - 794
ER -