ParTNER: Paragraph Tuning for Named Entity Recognition on Clinical Cases in Spanish using mBERT + Rules

Antonio Tamayo; Diego Burgos; Alexander Gelbukh

ParTNER: Paragraph Tuning for Named Entity Recognition on Clinical Cases in Spanish using mBERT + Rules

Antonio Tamayo, Diego Burgos, Alexander Gelbukh

Centro de Investigación en Computación (CIC)

Research output: Contribution to journal › Conference article › peer-review

Abstract

Named entity recognition (NER) and normalization are crucial tasks for information extraction in the medical field. They have been tackled through different approaches from rule-based systems and classic machine learning methods with feature engineering to the most sophisticated deep learning models; most of them for English. In this work, we present a transfer learning approach starting from multilingual BERT to tackle the problem of Spanish NER (species) and normalization in clinical cases by using sentence tokenization for training and a paragraph tuning strategy at the inference phase. We propose that text lengths at training and inference stages do not have to match and that such difference can leverage the model's performance according to the task. Our validation showed that using a context of three sentences during inference improves the F1 score in ≈1% compared to longer and shorter paragraphs and in ≈17% compared to the whole document. We also applied simple but effective post-processing rules on the model's output, which improved the Micro F1 score in ≈28%. Our system achieved an F1 of 0.8499 in the testing dataset of the LivingNER shared task 2022.

Original language	English
Journal	CEUR Workshop Proceedings
Volume	3202
State	Published - 2022
Event	2022 Iberian Languages Evaluation Forum, IberLEF 2022 - A Coruna, Spain Duration: 20 Sep 2022 → …

Keywords

Named entity recognition
multilingual BERT
normalization
paragraph tuning
transfer learning

Cite this

@article{c5770f7a6ea3452faa90624ade043395,

title = "ParTNER: Paragraph Tuning for Named Entity Recognition on Clinical Cases in Spanish using mBERT + Rules",

abstract = "Named entity recognition (NER) and normalization are crucial tasks for information extraction in the medical field. They have been tackled through different approaches from rule-based systems and classic machine learning methods with feature engineering to the most sophisticated deep learning models; most of them for English. In this work, we present a transfer learning approach starting from multilingual BERT to tackle the problem of Spanish NER (species) and normalization in clinical cases by using sentence tokenization for training and a paragraph tuning strategy at the inference phase. We propose that text lengths at training and inference stages do not have to match and that such difference can leverage the model's performance according to the task. Our validation showed that using a context of three sentences during inference improves the F1 score in ≈1% compared to longer and shorter paragraphs and in ≈17% compared to the whole document. We also applied simple but effective post-processing rules on the model's output, which improved the Micro F1 score in ≈28%. Our system achieved an F1 of 0.8499 in the testing dataset of the LivingNER shared task 2022.",

keywords = "Named entity recognition, multilingual BERT, normalization, paragraph tuning, transfer learning",

author = "Antonio Tamayo and Diego Burgos and Alexander Gelbukh",

note = "Publisher Copyright: {\textcopyright} 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).; 2022 Iberian Languages Evaluation Forum, IberLEF 2022 ; Conference date: 20-09-2022",

year = "2022",

language = "Ingl{\'e}s",

volume = "3202",

journal = "CEUR Workshop Proceedings",

issn = "1613-0073",

publisher = "CEUR-WS",

}

TY - JOUR

T1 - ParTNER

T2 - 2022 Iberian Languages Evaluation Forum, IberLEF 2022

AU - Tamayo, Antonio

AU - Burgos, Diego

AU - Gelbukh, Alexander

PY - 2022

Y1 - 2022

N2 - Named entity recognition (NER) and normalization are crucial tasks for information extraction in the medical field. They have been tackled through different approaches from rule-based systems and classic machine learning methods with feature engineering to the most sophisticated deep learning models; most of them for English. In this work, we present a transfer learning approach starting from multilingual BERT to tackle the problem of Spanish NER (species) and normalization in clinical cases by using sentence tokenization for training and a paragraph tuning strategy at the inference phase. We propose that text lengths at training and inference stages do not have to match and that such difference can leverage the model's performance according to the task. Our validation showed that using a context of three sentences during inference improves the F1 score in ≈1% compared to longer and shorter paragraphs and in ≈17% compared to the whole document. We also applied simple but effective post-processing rules on the model's output, which improved the Micro F1 score in ≈28%. Our system achieved an F1 of 0.8499 in the testing dataset of the LivingNER shared task 2022.

AB - Named entity recognition (NER) and normalization are crucial tasks for information extraction in the medical field. They have been tackled through different approaches from rule-based systems and classic machine learning methods with feature engineering to the most sophisticated deep learning models; most of them for English. In this work, we present a transfer learning approach starting from multilingual BERT to tackle the problem of Spanish NER (species) and normalization in clinical cases by using sentence tokenization for training and a paragraph tuning strategy at the inference phase. We propose that text lengths at training and inference stages do not have to match and that such difference can leverage the model's performance according to the task. Our validation showed that using a context of three sentences during inference improves the F1 score in ≈1% compared to longer and shorter paragraphs and in ≈17% compared to the whole document. We also applied simple but effective post-processing rules on the model's output, which improved the Micro F1 score in ≈28%. Our system achieved an F1 of 0.8499 in the testing dataset of the LivingNER shared task 2022.

KW - Named entity recognition

KW - multilingual BERT

KW - normalization

KW - paragraph tuning

KW - transfer learning

UR - http://www.scopus.com/inward/record.url?scp=85137333571&partnerID=8YFLogxK

M3 - Artículo de la conferencia

AN - SCOPUS:85137333571

SN - 1613-0073

VL - 3202

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

Y2 - 20 September 2022

ER -

ParTNER: Paragraph Tuning for Named Entity Recognition on Clinical Cases in Spanish using mBERT + Rules

Abstract

Keywords

Other files and links

Fingerprint

Cite this