/bionlp2017

Extract annotated misspellings from MIMIC-III.

Primary LanguagePython

This repository contains source code for the paper 'Unsupervised Context-Sensitive Spelling Correction for Clinical Free-Text with Word and Character N-Gram Embeddings', which will be presented at the BioNLP Workshop at ACL 2017. The source code offered here contains a script to extract our manually annotated MIMIC-III data.

Requirements

  • Python 2.7

To extract our manually annotated MIMIC-III test data, you should have access to the MIMIC-III v1.3 database. This should be v1.3 of the database, since the extraction script doesn't work anymore with later versions.

Extracting the annotated test data

To extract the annotated test data, git clone this repository and run

python2.7 extract_test.py [path to NOTEEVENTS.csv file from the MIMIC-III v1.3 database]

from inside the directory. This script preprocesses the NOTEEVENTS.csv data and stores the preprocessed data in the file mimic_preprocessed.txt. It then extracts the annotated test data, which is stored to the file testcorpus.json in four lists: correct replacements, misspellings, misspelling contexts, and line indices.