TY - GEN
T1 - Evaluating the Impact of OCR Quality on Short Texts Classification Task
AU - Vitman, Oxana
AU - Kostiuk, Yevhen
AU - Plachinda, Paul
AU - Zhila, Alisa
AU - Sidorov, Grigori
AU - Gelbukh, Alexander
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - The majority of text classification algorithms have been developed and evaluated for texts written by humans and originated in text mode. However, in the contemporary world with an abundance of smartphones and readily available cameras, the ever-increasing amount of textual information comes from the text captured on photographed objects such as road and business signs, product labels and price tags, random phrases on t-shirts, the list can be infinite. One way to process such information is to pass an image with a text in it through an Optical Character Recognition (OCR) processor and then apply a natural language processing (NLP) system to that text. However, OCR text is not quite equivalent to the ‘natural’ language or human-written text because spelling errors are not the same as those usually committed by humans. Implying that the distribution of human errors is different from the distribution of OCR errors, we compare how much and how it affects the classifiers. We focus on deterministic classifiers such as fuzzy search as well as on the popular Neural Network based classifiers including CNN, BERT, and RoBERTa. We discovered that applying spell corrector on OCRed text increases F1 score by 4% for CNN and by 2% for BERT.
AB - The majority of text classification algorithms have been developed and evaluated for texts written by humans and originated in text mode. However, in the contemporary world with an abundance of smartphones and readily available cameras, the ever-increasing amount of textual information comes from the text captured on photographed objects such as road and business signs, product labels and price tags, random phrases on t-shirts, the list can be infinite. One way to process such information is to pass an image with a text in it through an Optical Character Recognition (OCR) processor and then apply a natural language processing (NLP) system to that text. However, OCR text is not quite equivalent to the ‘natural’ language or human-written text because spelling errors are not the same as those usually committed by humans. Implying that the distribution of human errors is different from the distribution of OCR errors, we compare how much and how it affects the classifiers. We focus on deterministic classifiers such as fuzzy search as well as on the popular Neural Network based classifiers including CNN, BERT, and RoBERTa. We discovered that applying spell corrector on OCRed text increases F1 score by 4% for CNN and by 2% for BERT.
KW - BERT
KW - CNN
KW - Fuzzy search
KW - Multi-class classification
KW - NLP
KW - OCR
KW - RoBERTa
KW - Short texts
KW - Text classification
UR - http://www.scopus.com/inward/record.url?scp=85142778997&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-19496-2_13
DO - 10.1007/978-3-031-19496-2_13
M3 - Contribución a la conferencia
AN - SCOPUS:85142778997
SN - 9783031194955
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 163
EP - 177
BT - Advances in Computational Intelligence - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Proceedings
A2 - Pichardo Lagunas, Obdulia
A2 - Martínez Seis, Bella
A2 - Martínez-Miranda, Juan
PB - Springer Science and Business Media Deutschland GmbH
T2 - 21st Mexican International Conference on Artificial Intelligence, MICAI 2022
Y2 - 24 October 2022 through 29 October 2022
ER -