Evaluating the Impact of OCR Quality on Short Texts Classification Task

Oxana Vitman; Yevhen Kostiuk; Paul Plachinda; Alisa Zhila; Grigori Sidorov; Alexander Gelbukh

doi:10.1007/978-3-031-19496-2_13

Evaluating the Impact of OCR Quality on Short Texts Classification Task

Oxana Vitman, Yevhen Kostiuk, Paul Plachinda, Alisa Zhila, Grigori Sidorov, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

1 Cita (Scopus)

Resumen

The majority of text classification algorithms have been developed and evaluated for texts written by humans and originated in text mode. However, in the contemporary world with an abundance of smartphones and readily available cameras, the ever-increasing amount of textual information comes from the text captured on photographed objects such as road and business signs, product labels and price tags, random phrases on t-shirts, the list can be infinite. One way to process such information is to pass an image with a text in it through an Optical Character Recognition (OCR) processor and then apply a natural language processing (NLP) system to that text. However, OCR text is not quite equivalent to the ‘natural’ language or human-written text because spelling errors are not the same as those usually committed by humans. Implying that the distribution of human errors is different from the distribution of OCR errors, we compare how much and how it affects the classifiers. We focus on deterministic classifiers such as fuzzy search as well as on the popular Neural Network based classifiers including CNN, BERT, and RoBERTa. We discovered that applying spell corrector on OCRed text increases F1 score by 4% for CNN and by 2% for BERT.

Idioma original	Inglés
Título de la publicación alojada	Advances in Computational Intelligence - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Proceedings
Editores	Obdulia Pichardo Lagunas, Bella Martínez Seis, Juan Martínez-Miranda
Editorial	Springer Science and Business Media Deutschland GmbH
Páginas	163-177
Número de páginas	15
ISBN (versión impresa)	9783031194955
DOI	https://doi.org/10.1007/978-3-031-19496-2_13
Estado	Publicada - 2022
Evento	21st Mexican International Conference on Artificial Intelligence, MICAI 2022 - Monterrey, México Duración: 24 oct. 2022 → 29 oct. 2022

Serie de la publicación

Nombre	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volumen	13613 LNAI
ISSN (versión impresa)	0302-9743
ISSN (versión digital)	1611-3349

Conferencia

Conferencia	21st Mexican International Conference on Artificial Intelligence, MICAI 2022
País/Territorio	México
Ciudad	Monterrey
Período	24/10/22 → 29/10/22

Acceder al documento

10.1007/978-3-031-19496-2_13

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

Vitman, O., Kostiuk, Y., Plachinda, P., Zhila, A., Sidorov, G., & Gelbukh, A. (2022). Evaluating the Impact of OCR Quality on Short Texts Classification Task. En O. Pichardo Lagunas, B. Martínez Seis, & J. Martínez-Miranda (Eds.), Advances in Computational Intelligence - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Proceedings (pp. 163-177). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13613 LNAI). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-19496-2_13

Vitman, Oxana ; Kostiuk, Yevhen ; Plachinda, Paul et al. / Evaluating the Impact of OCR Quality on Short Texts Classification Task. Advances in Computational Intelligence - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Proceedings. editor / Obdulia Pichardo Lagunas ; Bella Martínez Seis ; Juan Martínez-Miranda. Springer Science and Business Media Deutschland GmbH, 2022. pp. 163-177 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{23f6318ff10d4909935d930855c8577b,

title = "Evaluating the Impact of OCR Quality on Short Texts Classification Task",

abstract = "The majority of text classification algorithms have been developed and evaluated for texts written by humans and originated in text mode. However, in the contemporary world with an abundance of smartphones and readily available cameras, the ever-increasing amount of textual information comes from the text captured on photographed objects such as road and business signs, product labels and price tags, random phrases on t-shirts, the list can be infinite. One way to process such information is to pass an image with a text in it through an Optical Character Recognition (OCR) processor and then apply a natural language processing (NLP) system to that text. However, OCR text is not quite equivalent to the {\textquoteleft}natural{\textquoteright} language or human-written text because spelling errors are not the same as those usually committed by humans. Implying that the distribution of human errors is different from the distribution of OCR errors, we compare how much and how it affects the classifiers. We focus on deterministic classifiers such as fuzzy search as well as on the popular Neural Network based classifiers including CNN, BERT, and RoBERTa. We discovered that applying spell corrector on OCRed text increases F1 score by 4% for CNN and by 2% for BERT.",

keywords = "BERT, CNN, Fuzzy search, Multi-class classification, NLP, OCR, RoBERTa, Short texts, Text classification",

author = "Oxana Vitman and Yevhen Kostiuk and Paul Plachinda and Alisa Zhila and Grigori Sidorov and Alexander Gelbukh",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.; 21st Mexican International Conference on Artificial Intelligence, MICAI 2022 ; Conference date: 24-10-2022 Through 29-10-2022",

year = "2022",

doi = "10.1007/978-3-031-19496-2_13",

language = "Ingl{\'e}s",

isbn = "9783031194955",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "163--177",

editor = "{Pichardo Lagunas}, Obdulia and {Mart{\'i}nez Seis}, Bella and Juan Mart{\'i}nez-Miranda",

booktitle = "Advances in Computational Intelligence - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Proceedings",

address = "Alemania",

}

Vitman, O, Kostiuk, Y, Plachinda, P, Zhila, A, Sidorov, G & Gelbukh, A 2022, Evaluating the Impact of OCR Quality on Short Texts Classification Task. En O Pichardo Lagunas, B Martínez Seis & J Martínez-Miranda (eds.), Advances in Computational Intelligence - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13613 LNAI, Springer Science and Business Media Deutschland GmbH, pp. 163-177, 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Monterrey, México, 24/10/22. https://doi.org/10.1007/978-3-031-19496-2_13

Evaluating the Impact of OCR Quality on Short Texts Classification Task. / Vitman, Oxana; Kostiuk, Yevhen; Plachinda, Paul et al.
Advances in Computational Intelligence - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Proceedings. ed. / Obdulia Pichardo Lagunas; Bella Martínez Seis; Juan Martínez-Miranda. Springer Science and Business Media Deutschland GmbH, 2022. p. 163-177 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13613 LNAI).

Producción científica: Capítulo del libro/informe/acta de congreso › Contribución a la conferencia › revisión exhaustiva

TY - GEN

T1 - Evaluating the Impact of OCR Quality on Short Texts Classification Task

AU - Vitman, Oxana

AU - Kostiuk, Yevhen

AU - Plachinda, Paul

AU - Zhila, Alisa

AU - Sidorov, Grigori

AU - Gelbukh, Alexander

PY - 2022

Y1 - 2022

N2 - The majority of text classification algorithms have been developed and evaluated for texts written by humans and originated in text mode. However, in the contemporary world with an abundance of smartphones and readily available cameras, the ever-increasing amount of textual information comes from the text captured on photographed objects such as road and business signs, product labels and price tags, random phrases on t-shirts, the list can be infinite. One way to process such information is to pass an image with a text in it through an Optical Character Recognition (OCR) processor and then apply a natural language processing (NLP) system to that text. However, OCR text is not quite equivalent to the ‘natural’ language or human-written text because spelling errors are not the same as those usually committed by humans. Implying that the distribution of human errors is different from the distribution of OCR errors, we compare how much and how it affects the classifiers. We focus on deterministic classifiers such as fuzzy search as well as on the popular Neural Network based classifiers including CNN, BERT, and RoBERTa. We discovered that applying spell corrector on OCRed text increases F1 score by 4% for CNN and by 2% for BERT.

AB - The majority of text classification algorithms have been developed and evaluated for texts written by humans and originated in text mode. However, in the contemporary world with an abundance of smartphones and readily available cameras, the ever-increasing amount of textual information comes from the text captured on photographed objects such as road and business signs, product labels and price tags, random phrases on t-shirts, the list can be infinite. One way to process such information is to pass an image with a text in it through an Optical Character Recognition (OCR) processor and then apply a natural language processing (NLP) system to that text. However, OCR text is not quite equivalent to the ‘natural’ language or human-written text because spelling errors are not the same as those usually committed by humans. Implying that the distribution of human errors is different from the distribution of OCR errors, we compare how much and how it affects the classifiers. We focus on deterministic classifiers such as fuzzy search as well as on the popular Neural Network based classifiers including CNN, BERT, and RoBERTa. We discovered that applying spell corrector on OCRed text increases F1 score by 4% for CNN and by 2% for BERT.

KW - BERT

KW - CNN

KW - Fuzzy search

KW - Multi-class classification

KW - NLP

KW - OCR

KW - RoBERTa

KW - Short texts

KW - Text classification

UR - http://www.scopus.com/inward/record.url?scp=85142778997&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-19496-2_13

DO - 10.1007/978-3-031-19496-2_13

M3 - Contribución a la conferencia

AN - SCOPUS:85142778997

SN - 9783031194955

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 163

EP - 177

BT - Advances in Computational Intelligence - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Proceedings

A2 - Pichardo Lagunas, Obdulia

A2 - Martínez Seis, Bella

A2 - Martínez-Miranda, Juan

PB - Springer Science and Business Media Deutschland GmbH

T2 - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022

Y2 - 24 October 2022 through 29 October 2022

ER -

Vitman O, Kostiuk Y, Plachinda P, Zhila A, Sidorov G , Gelbukh A. Evaluating the Impact of OCR Quality on Short Texts Classification Task. En Pichardo Lagunas O, Martínez Seis B, Martínez-Miranda J, editores, Advances in Computational Intelligence - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Proceedings. Springer Science and Business Media Deutschland GmbH. 2022. p. 163-177. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-19496-2_13