Negation And Uncertainty Detection in clinical texts written in Spanish: A deep learning-based approach

This repository contains a deep learning-based approach for uncertainty and negation detection from clinical texts written in Spanish.

METHODS

The approach explores two deep learning methods to perform negation and uncertainy detection from clinical text written in Spanish: BILSTM and BERT:
  • Bidirectional Long Short memory (BiLSTM-CRF): This method consist of three layers: Embedding layer, BiLSTM layer, and CRF layer. The directory "BiLSTM" contains the implementation for this method.

    • Embeddings We used two types of embeddings: biomedical embeddings and clinical embeddings. Biomedical embeddings for the Spanish language [2] can be download from Zenodo. Clinical embeddings were trained with more than 1 million of clinical notes of two public hospitals in Spain and Colombia. Clinical embeddings can be available only after an evaluation of the hospital's ethics committee

  • Bidirectional Encoder Representation for Transformers (BERT): We use the pre-trained BERT model fine tune with a classification layer on top. We use Multilingual BERT as contextualized embeddings. The directory "BERT" contains the implementation for this method.

Datasets

We use three datasets to evaluate the proposed approach for negation and uncertainty detection. NUBES [3] and IULA [4] are two public corpora available for the Spanish language.The third dataset is an in-house annotated corpus with real-life data of cancer patients.

Trained Models

We provide trained models on the NUBES corpus. These models can be used to evaluate or exploit them by performing real-life study cases with clinical notes written in Spanish. Trained models can also be used to integrate them into medical text mining applications. The directory "trained_models" contains instructions for using these models.



Pre-processing

The datasets previously described are pre-processed before being used the BiLSTM and BERT-based models. We provide scripts that pre-processes the datasets (See Pre-processing directory).

Validation

This directory contains scripts for loading trained models on the NUBES corpus, and perform negation and uncertainty detetection in a different dataset. This code can be used to validate sentence by sentence or to validate a complete dataset such as the Cancer dataset.

Contact

If you have any question or suggestion, please contact us at the following email address: oswaldo.solartep@alumnos.upm.es



References:

  1. Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural architectures for named entity recognition.2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016

  2. Soares, F.; Villegas, M.; Gonzalez-Agirre, A.; Krallinger, M.; Armengol-Estapé, J. Medical Word Embeddings787for Spanish: Development and Evaluation. Proceedings of the 2nd Clinical Natural Language Processing788Workshop; Association for Computational Linguistics: Minneapolis, Minnesota, USA, 2019

  3. Lima, S.; Perez, N.; Cuadros, M.; Rigau, G. NUBES: A Corpus of Negation and Uncertainty in Spanish Clinical Texts 2020. Proceedings of the Workshop Computational Semantics Beyond Events and Roles, Valencia, Spain.

  4. Marimon, M.; Vivaldi, J.; Bel, N. Annotation of negation in the IULA Spanish Clinical Record Corpus. Proceedings of the Workshop Computational Semantics
    Beyond Events and Roles; Association for807Computational Linguistics: Valencia, Spain, 2017; pp. 43–52. doi:10.18653/v1/W17-1807

  5. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for795language understanding.NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 2019