Evaluating the Impact of OCR Quality on Short Texts Classification Task

Oxana Vitman, Yevhen Kostiuk, Paul Plachinda, Alisa Zhila, Grigori Sidorov, Alexander Gelbukh

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

1 Cita (Scopus)

Resumen

The majority of text classification algorithms have been developed and evaluated for texts written by humans and originated in text mode. However, in the contemporary world with an abundance of smartphones and readily available cameras, the ever-increasing amount of textual information comes from the text captured on photographed objects such as road and business signs, product labels and price tags, random phrases on t-shirts, the list can be infinite. One way to process such information is to pass an image with a text in it through an Optical Character Recognition (OCR) processor and then apply a natural language processing (NLP) system to that text. However, OCR text is not quite equivalent to the ‘natural’ language or human-written text because spelling errors are not the same as those usually committed by humans. Implying that the distribution of human errors is different from the distribution of OCR errors, we compare how much and how it affects the classifiers. We focus on deterministic classifiers such as fuzzy search as well as on the popular Neural Network based classifiers including CNN, BERT, and RoBERTa. We discovered that applying spell corrector on OCRed text increases F1 score by 4% for CNN and by 2% for BERT.

Idioma originalInglés
Título de la publicación alojadaAdvances in Computational Intelligence - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Proceedings
EditoresObdulia Pichardo Lagunas, Bella Martínez Seis, Juan Martínez-Miranda
EditorialSpringer Science and Business Media Deutschland GmbH
Páginas163-177
Número de páginas15
ISBN (versión impresa)9783031194955
DOI
EstadoPublicada - 2022
Evento21st Mexican International Conference on Artificial Intelligence, MICAI 2022 - Monterrey, México
Duración: 24 oct. 202229 oct. 2022

Serie de la publicación

NombreLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volumen13613 LNAI
ISSN (versión impresa)0302-9743
ISSN (versión digital)1611-3349

Conferencia

Conferencia21st Mexican International Conference on Artificial Intelligence, MICAI 2022
País/TerritorioMéxico
CiudadMonterrey
Período24/10/2229/10/22

Huella

Profundice en los temas de investigación de 'Evaluating the Impact of OCR Quality on Short Texts Classification Task'. En conjunto forman una huella única.

Citar esto