NUBes: A Corpus of Negation and Uncertainty in Spanish Clinical Texts

This repository contains the NUBes corpus and other related material.

WARNING: Please note that the repository contains Git LFS files. If you have any problem cloning these files, please get in touch and we will provide alternative ways to obtain the data.

NUBes and IULA+
LREC2020
Citation
License
Contact

NUBes and IULA+

The NUBes corpus (from "Negation and Uncertainty annotations in Biomedical texts in Spanish") consists of sentences obtained from anonymised health records and annotated with negation and uncertainty phenomena. As far as we know, it is currently the largest publicly available corpus for negation in Spanish and the first that also incorporates the annotation of speculation cues, scopes, and events.

IULA+ is a new version of the IULA-SCRC corpus (accessible here). More specifically, it consists of the same texts but annotated with NUBes' guidelines.

A couple of interesting remarks:

The annotation guidelines can be consulted here (in Spanish).
NUBes and IULA+ are distributed in BRAT standoff format.

Size and Composition

NUBes is divided into 10 samples of approximately 3K sentences each.
The first sample (SAMPLE-001) has been annotated following a process involving 2 annotators and a referee, while the rest have been annotated by one person only. In this regard, the first sample can be said to be of higher quality.
The sentences in each sample have been in turn grouped by the medical specialty and report sections to which they belong. So, for instance, the file sample-001.traum.chico.txt contains sentences of the specialty "traumatology" and the section "chief complaint".

You can consult the sizes of NUBes and IULA+ in the following table:

	NUBes	IULA+
overall stats
sentences	29,682	3,363
tokens	518,068	38,208
vocabulary size	31,698	8,651
negation
sentences affected	7,567	1,022^*
average cues per affected sentence	1.25 +/- 0.66	1.20 +/- 0.59
discontinuous cues	0	0
average scope size in tokens	4.01 +/- 3.59	3.13 +/- 2.67
discontinuous scopes	219	24
uncertainty
sentences affected	2,219	178^*
average cues per affected sentence	1.12 +/- 0.38	1.12 +/- 0.38
discontinuous cues	95	20
average scope size in tokens	5.27 +/- 4.97	4.75 +/- 3.96
discontinuous scopes	123	7

^* These numbers do not match those in the paper's Table 1; the correct counts are shown here.

Data Protection

All sensitive information (e.g., people names, healthcare facilities, dates, and so on) in NUBes have been subsituted with fake similar data. Furthermore, the sentences have been shuffled in order to hinder de-anonymization efforts as much as possible. That is, subsequent sentences in the corpus provided most likely did not occur together in the original health records. What is more, sentences that belong to the same health record are scattered across different samples.

Related Publications

To know more about NUBes, read our article "NUBes: A Corpus of Negation and Uncertainty in Spanish Clinical Texts" [pdf].

To know more about IULA-SCRC, read the article "Annotation of negation in the IULA Spanish Clinical Record Corpus" by Montserrat Marimon, Jorge Vivaldi and Núria Bel [pdf].

Please see Section Citation to learn how to cite these works.

LREC2020

This directory contains material and scripts related to the experiments section in the paper "NUBes: A Corpus of Negation and Uncertainty in Spanish Clinical Texts" (see above).

Dataset

`data/`

Here you will find the dataset splits —train, development and test— used in the experiments. Specifically, the files provided contain the full set of features described in the paper. Use the script ablation.py, explained below, to obtain the files to conduct the ablation study.

WARNING: Please note that this dataset is stored with Git LFS. If you have any problem cloning these files, please get in touch and we will provide alternative ways to obtain the data.

`ablation.py`

Python version	Dependencies
>= Python3.5	pandas

This script generates the files necessary to perform the ablation study described in the paper. For each data split, it will generate 6 new files, each with a different group of features left out. For instance, from the file data/train.bio, it will create:

data/train.abl-form.bio
data/train.abl-morphsyn.bio
data/train.abl-brown.bio
data/train.abl-metadata.bio
data/train.abl-window.bio
data/train.token.bio

The last file does not contain any feature apart from the tokens themselves.

Optionally, you may indicate input and output paths, as well as the number of parallel processes to be launched:

usage: python3 ablation.py [-h] [-i I] [-o O] [-p P]

optional arguments:
  -h, --help           show this help message and exit
  -i I, --input I      path to directory containing the dataset (default: data)
  -o O, --output O     path to output directory (default: data)
  -p P, --processes P  number of parallel processes (default: cpu_count()//2)

The following command should work out of the box:

python3 ablation.py

Training

`train.config`

This is the NCRF++ configuration file we used for training the models described in the paper. This configuration file specifies the neural network's architecture and hyperparamters. Read about how to install NCRF++ and how to train models and use them for decoding at https://github.com/jiesutd/NCRFpp.

NOTE: you must change the I/O section in the file so that

it points to your dataset files (parameters train_dir, dev_dir and test_dir)
it writes the resulting models to the desired folder and with the desired name (parameter model_dir)

### I/O ###
train_dir=/PATH/TO/train.bio
dev_dir=/PATH/TO/dev.bio
test_dir=/PATH/TO/test.bio
model_dir=/PATH/TO/<MODEL NAME>

Evaluation

`eval.py`

Python version	Dependencies
>= Python3.5	sklearn, pandas

This is the script we used to obtain the results reported in the article.

It offers 3 evaluation scenarios:

"bin" or "binary": binary classification (i.e., "IN" or "OUT")
"class" or "type": category classification
"full": category and BIO-tag classification

The results reported in the paper correspond to the category classification scenario.

The usage of the script is as follows:

usage: python3 eval.py [-h] [--task {bin,class,full}] [--true T] pred [pred ...]

positional arguments:
  pred                  path to predictions file (accepts multiple paths)

optional arguments:
  -h, --help            show this help message and exit
  --task {bin,class,full}
                        evaluation mode (see above; default: "cat")
  --true T              path to gold standard file (default: data/test.bio)

Once you have trained your own model(s) and decoded the test set, you may evaluate the results simply by doing:

python3 eval.py /PATH/TO/PREDICTION-1 /PATH/TO/PREDICTION-2 /PATH/TO/PREDICTION-3

If you do:

python3 eval.py data/test.bio

you should obtain perfect results (because you will be evaluating the gold labels against themselves).

Citation

If you use NUBes, IULA+ or any of the provided material in your publications, please cite us appropriately:

@inproceedings{lima2020nubes,
  author      = {Salvador Lima Lopez and Naiara Perez and Montse Cuadros and German Rigau},
  title       = "{NUBes: A Corpus of Negation and Uncertainty in Spanish Clinical Texts}",
  booktitle   = {Proceedings of The 12th Language Resources and Evaluation Conference (LREC2020)},
  month       = {May},
  year        = {2020},
  address     = {Marseille, France},
  publisher   = {European Language Resources Association},
  pages       = {5772--5781}
}

If you use IULA+, please cite also the paper describing the original corpus, IULA-SCRC:

@inproceedings{marimon2017annotation,
  author      = {Montserrat Marimon and Jorge Vivaldi and N{\'u}ria Bel Rafecas},
  title       = "{Annotation of negation in the IULA Spanish Clinical Record Corpus}",
  booktitle   = {Proceedings of the Workshop Computational Semantics Beyond Events and Roles (SemBEaR)},
  month       = {Apr},
  year        = {2017},
  address     = {Valencia, Spain},
  publisher   = {Association for Computational Linguistics},
  pages       = {43--52}
}

License

The resources NUBes, IULA+, NUBes experiment splits and NUBes annotation guidelines are licensed under the Creative Commons Attribution-ShareAlike 3.0 Spain License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/es/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

The scripts ablation.py and eval.py are copyright of Vicomtech -- (c) 2020 Vicomtech. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Contact

If you have any question or suggestion, do not hesitate to contact us at the following addresses:

Naiara Perez: nperez@vicomtech.org
Salvador Lima: slima@vicomtech.org
Montse Cuadros: mcuadros@vicomtech.org
German Rigau: german.rigau@ehu.eus

Vicomtech/NUBes-negation-uncertainty-biomedical-corpus