TY - GEN
T1 - A Hybrid Methodology Based on CRISP-DM and TDSP for the Execution of Preprocessing Tasks in Mexican Environmental Laws
AU - Díaz Álvarez, Yessenia
AU - Hidalgo Reyes, Miguel Ángel
AU - Lagunes Barradas, Virginia
AU - Pichardo Lagunas, Obdulia
AU - Martínez Seis, Bella
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - This article focuses on the one hand, on showing some techniques applied during the preprocessing of texts represented by environmental laws of Mexico. The need to carry out this type of analysis is due to several factors such as: the large number of existing legislative documents such as laws, programs, regulations, etc., the modifications that are made to the legal system due to reforms and decrees, and especially, to those possible contradictions that may arise among one or more laws. On the other hand, certain tasks of the CRISP-DM methodology were selected and, specifically, for the data preparation phase in the generic tasks of selection, cleaning, transformation, and formatting. This was done using the NLTK library through text preprocessing techniques of tokenization, segmentation, denoising and normalization. Among the most remarkable results there is a combination between CRISP-DM and Team Data Science Process by Microsoft oriented to the preprocessing of Mexican federal environmental laws. In addition, this article shows a detailed application of the hybrid methodology with the execution of a specialized task related to the extraction of text from a pdf file using the PyPDF2 and Pdfplumber libraries.
AB - This article focuses on the one hand, on showing some techniques applied during the preprocessing of texts represented by environmental laws of Mexico. The need to carry out this type of analysis is due to several factors such as: the large number of existing legislative documents such as laws, programs, regulations, etc., the modifications that are made to the legal system due to reforms and decrees, and especially, to those possible contradictions that may arise among one or more laws. On the other hand, certain tasks of the CRISP-DM methodology were selected and, specifically, for the data preparation phase in the generic tasks of selection, cleaning, transformation, and formatting. This was done using the NLTK library through text preprocessing techniques of tokenization, segmentation, denoising and normalization. Among the most remarkable results there is a combination between CRISP-DM and Team Data Science Process by Microsoft oriented to the preprocessing of Mexican federal environmental laws. In addition, this article shows a detailed application of the hybrid methodology with the execution of a specialized task related to the extraction of text from a pdf file using the PyPDF2 and Pdfplumber libraries.
KW - Environmental laws
KW - Methodologies
KW - NLTK
KW - Preprocessing
KW - Text mining
UR - http://www.scopus.com/inward/record.url?scp=85142801722&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-19496-2_6
DO - 10.1007/978-3-031-19496-2_6
M3 - Contribución a la conferencia
AN - SCOPUS:85142801722
SN - 9783031194955
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 68
EP - 82
BT - Advances in Computational Intelligence - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Proceedings
A2 - Pichardo Lagunas, Obdulia
A2 - Martínez Seis, Bella
A2 - Martínez-Miranda, Juan
PB - Springer Science and Business Media Deutschland GmbH
T2 - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022
Y2 - 24 October 2022 through 29 October 2022
ER -