Unified, Labeled, and Semi-Structured Database of Pre-Processed Mexican Laws

Bella Martinez-Seis; Obdulia Pichardo-Lagunas; Harlan Koff; Miguel Equihua; Octavio Perez-Maqueo; Arturo Hernández-Huerta

doi:10.3390/data7070091

Unified, Labeled, and Semi-Structured Database of Pre-Processed Mexican Laws

Bella Martinez-Seis, Obdulia Pichardo-Lagunas, Harlan Koff, Miguel Equihua, Octavio Perez-Maqueo, Arturo Hernández-Huerta

Unidad Profesional Interdisciplinaria de Ingeniería y Tecnologías Avanzadas (UPIITA)

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

Resumen

This paper presents a corpus of pre-processed Mexican laws for computational tasks. The main contributions are the proposed JSON structure and the methodology used to achieve the semi-structured corpus with the selected algorithms. Law PDF documents were transformed into plain text, unified by a deconstruction of law–document structure, and labeled with natural language processing techniques considering part of speech (PoS); a process of entity extraction was also performed. The corpus includes the Mexican constitution and the Mexican laws that were collected from the official site in PDF format repealed before 14 October 2021. The collection has 305 documents, including: the Mexican constitution, 289 laws, 8 federal codes, 3 regulations, 2 statutes, 1 decree, and 1 ordinance. The semi-structured database includes the transformation of the set of laws from PDF format to a digital representation in order to facilitate its computational analysis. The documents were migrated to JSON type files to represent internal hierarchical relations. In addition, basic natural language processing techniques were implemented on laws for the identification of part of speech and named entities. The presented data set is mainly useful for text analysis and data science. It could be used for various legislative analysis tasks including: comprehension, interpretation, translation, classification, accessibility, coherence, and searches. Finally, we present some statistic of the identified entities and an example of the usefulness of the corpus for environmental laws.

Idioma original	Inglés
Número de artículo	91
Publicación	Data
Volumen	7
N.º	7
DOI	https://doi.org/10.3390/data7070091
Estado	Publicada - jul. 2022

Acceder al documento

10.3390/data7070091

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{cff5723d492441b99fe49784c32bb66c,

title = "Unified, Labeled, and Semi-Structured Database of Pre-Processed Mexican Laws",

abstract = "This paper presents a corpus of pre-processed Mexican laws for computational tasks. The main contributions are the proposed JSON structure and the methodology used to achieve the semi-structured corpus with the selected algorithms. Law PDF documents were transformed into plain text, unified by a deconstruction of law–document structure, and labeled with natural language processing techniques considering part of speech (PoS); a process of entity extraction was also performed. The corpus includes the Mexican constitution and the Mexican laws that were collected from the official site in PDF format repealed before 14 October 2021. The collection has 305 documents, including: the Mexican constitution, 289 laws, 8 federal codes, 3 regulations, 2 statutes, 1 decree, and 1 ordinance. The semi-structured database includes the transformation of the set of laws from PDF format to a digital representation in order to facilitate its computational analysis. The documents were migrated to JSON type files to represent internal hierarchical relations. In addition, basic natural language processing techniques were implemented on laws for the identification of part of speech and named entities. The presented data set is mainly useful for text analysis and data science. It could be used for various legislative analysis tasks including: comprehension, interpretation, translation, classification, accessibility, coherence, and searches. Finally, we present some statistic of the identified entities and an example of the usefulness of the corpus for environmental laws.",

keywords = "Mexican legislation, laws, legislative documents, natural language processing",

author = "Bella Martinez-Seis and Obdulia Pichardo-Lagunas and Harlan Koff and Miguel Equihua and Octavio Perez-Maqueo and Arturo Hern{\'a}ndez-Huerta",

note = "Publisher Copyright: {\textcopyright} 2022 by the authors. Licensee MDPI, Basel, Switzerland.",

year = "2022",

month = jul,

doi = "10.3390/data7070091",

language = "Ingl{\'e}s",

volume = "7",

journal = "Data",

issn = "2306-5729",

publisher = "MDPI AG",

number = "7",

}

TY - JOUR

T1 - Unified, Labeled, and Semi-Structured Database of Pre-Processed Mexican Laws

AU - Martinez-Seis, Bella

AU - Pichardo-Lagunas, Obdulia

AU - Koff, Harlan

AU - Equihua, Miguel

AU - Perez-Maqueo, Octavio

AU - Hernández-Huerta, Arturo

PY - 2022/7

Y1 - 2022/7

N2 - This paper presents a corpus of pre-processed Mexican laws for computational tasks. The main contributions are the proposed JSON structure and the methodology used to achieve the semi-structured corpus with the selected algorithms. Law PDF documents were transformed into plain text, unified by a deconstruction of law–document structure, and labeled with natural language processing techniques considering part of speech (PoS); a process of entity extraction was also performed. The corpus includes the Mexican constitution and the Mexican laws that were collected from the official site in PDF format repealed before 14 October 2021. The collection has 305 documents, including: the Mexican constitution, 289 laws, 8 federal codes, 3 regulations, 2 statutes, 1 decree, and 1 ordinance. The semi-structured database includes the transformation of the set of laws from PDF format to a digital representation in order to facilitate its computational analysis. The documents were migrated to JSON type files to represent internal hierarchical relations. In addition, basic natural language processing techniques were implemented on laws for the identification of part of speech and named entities. The presented data set is mainly useful for text analysis and data science. It could be used for various legislative analysis tasks including: comprehension, interpretation, translation, classification, accessibility, coherence, and searches. Finally, we present some statistic of the identified entities and an example of the usefulness of the corpus for environmental laws.

AB - This paper presents a corpus of pre-processed Mexican laws for computational tasks. The main contributions are the proposed JSON structure and the methodology used to achieve the semi-structured corpus with the selected algorithms. Law PDF documents were transformed into plain text, unified by a deconstruction of law–document structure, and labeled with natural language processing techniques considering part of speech (PoS); a process of entity extraction was also performed. The corpus includes the Mexican constitution and the Mexican laws that were collected from the official site in PDF format repealed before 14 October 2021. The collection has 305 documents, including: the Mexican constitution, 289 laws, 8 federal codes, 3 regulations, 2 statutes, 1 decree, and 1 ordinance. The semi-structured database includes the transformation of the set of laws from PDF format to a digital representation in order to facilitate its computational analysis. The documents were migrated to JSON type files to represent internal hierarchical relations. In addition, basic natural language processing techniques were implemented on laws for the identification of part of speech and named entities. The presented data set is mainly useful for text analysis and data science. It could be used for various legislative analysis tasks including: comprehension, interpretation, translation, classification, accessibility, coherence, and searches. Finally, we present some statistic of the identified entities and an example of the usefulness of the corpus for environmental laws.

KW - Mexican legislation

KW - laws

KW - legislative documents

KW - natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85133856967&partnerID=8YFLogxK

U2 - 10.3390/data7070091

DO - 10.3390/data7070091

M3 - Artículo

AN - SCOPUS:85133856967

SN - 2306-5729

VL - 7

JO - Data

JF - Data

IS - 7

M1 - 91

ER -

Unified, Labeled, and Semi-Structured Database of Pre-Processed Mexican Laws

Resumen

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto