TY - JOUR
T1 - Statistical Entropy Measures in C4.5 Trees
AU - Arellano, Aldo Ramirez
AU - Bory-Reyes, Juan
AU - Hernandez-Simon, Luis Manuel
N1 - Publisher Copyright:
Copyright © 2018, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
PY - 2018/1/1
Y1 - 2018/1/1
N2 - The main goal of this article is to present a statistical study of decision tree learning algorithms based on the measures of different parametric entropies. Partial empirical evidence is presented to support the conjecture that the parameter adjusting of different entropy measures might bias the classification. Here, the receiver operating characteristic (ROC) curve analysis, precisely, the area under the ROC curve (AURC) gives the best criterion to evaluate decision trees based on parametric entropies. The authors emphasize that the improvement of the AURC relies on of the type of each dataset. The results support the hypothesis that parametric algorithms are useful for datasets with numeric and nominal, but not for mixed, attributes; thus, four hybrid approaches are proposed. The hybrid algorithm, which is based on Renyi entropy, is suitable for nominal, numeric, and mixed datasets. Moreover, it requires less time when the number of nodes is reduced, when the AURC is maintaining or increasing, thus it is preferable in large datasets.
AB - The main goal of this article is to present a statistical study of decision tree learning algorithms based on the measures of different parametric entropies. Partial empirical evidence is presented to support the conjecture that the parameter adjusting of different entropy measures might bias the classification. Here, the receiver operating characteristic (ROC) curve analysis, precisely, the area under the ROC curve (AURC) gives the best criterion to evaluate decision trees based on parametric entropies. The authors emphasize that the improvement of the AURC relies on of the type of each dataset. The results support the hypothesis that parametric algorithms are useful for datasets with numeric and nominal, but not for mixed, attributes; thus, four hybrid approaches are proposed. The hybrid algorithm, which is based on Renyi entropy, is suitable for nominal, numeric, and mixed datasets. Moreover, it requires less time when the number of nodes is reduced, when the AURC is maintaining or increasing, thus it is preferable in large datasets.
KW - Classification
KW - Data Mining
KW - Decision Trees
KW - Entropy Measures
KW - Information Theory
UR - http://www.scopus.com/inward/record.url?scp=85042133770&partnerID=8YFLogxK
U2 - 10.4018/IJDWM.2018010101
DO - 10.4018/IJDWM.2018010101
M3 - Artículo
SN - 1548-3924
VL - 14
SP - 1
EP - 14
JO - International Journal of Data Warehousing and Mining
JF - International Journal of Data Warehousing and Mining
IS - 1
ER -