Development and Validation of a Model to Identify Critical Brain Injuries Using Natural Languege Processing of Text Computed Tomography Reports
This project was develop by Dr. Kim Lab https://medicine.yale.edu/lab/kim/
Paper source https://jamanetwork.com/journals/jamanetworkopen https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2795179
Clinical text reports from computed tomography(CT) represent rich, incompletely utilized information regarding acute brain injuries and neurologic outcomes. CT reports are unstructured; thus, extracting information at scare requires automated natural language processing (NLP). However, designing a new NLP algorithm for each individual category is an unwieldy proposition. An NLP tool that summarizes all injuries in head CT reports would facilitate the exploration of large data sets for clinical significance of neuroradiological findings.
Wide output by the model by report (You can batch scrip this process in order to have multiple reports in one dataframe or table)
Anaconda or Miniconda Environments https://docs.conda.io/en/latest/miniconda.html
Python 3.7 or later. https://www.python.org/ or https://www.python.org/downloads/release/python-379/
Spacy 3.0 https://spacy.io/ ( The model can work in newer versions of spacy, it would raise a warning, we would retrain the model in the newer versions)
How to Create a new environment:
In your Anaconda cmd enter:
conda create -n Bner python=3.7
conda activate Bner
conda install -c conda-forge spacy
conda install -c conda-forge cupy
Inside the Models folder you will find the link to download the models. In order to use a model you need to save the model in your project directory. Or you can save the model in another directory and call the model directory from the load function
Below is the link to download the model from our drive:
https://drive.google.com/drive/folders/1iO3qA9XppRrnOqZWrrDyWXuIVpr7FzEJ?usp=sharing
Inside the Examples folder you will find few examples on how to use the model using jupyter notebooks. More examples to come**
The link below will take you to a interactive tutorial on how to use the model. We will have many more tutorials soon.
https://colab.research.google.com/drive/1x7Kul4J4rYuIdlEuR8QRgtxzN6IXXHjg?usp=sharing
This project seeks to execute term tagging to represent CT reports as a mapping of important terms to their presence in the document.
Meaning | What to place in cell |
---|---|
Pertinent positive - Term is mentioned specifically as being present in patient | + |
Pertinent negative - Term is mentioned specifically as not present in patient | - |
Not mentioned - Term is not mentioned | 0 |
Possible - Term is mentioned with some uncertainty as to presence | P |
Reports are the long data output from our entity recognition model run on each document. They are saved in this repo under /Long_data_manuscript or you can update the directory in constants.py
These reports are CSV files with the following headers: Term number,Entities,val
This is the script which runs our term tagging algorithm and checks its performance against the provided validation file. Additionally, it will create data.sql with the sql database representing the output as well as executes the queries outlined in /queries folder.
This is the script which runs our term tagging algorithm as well as property extraction. Additionally, it creates a database file named properties.sql as well as executes the queries outlined in /queries folder.
When running you need the following files: BrainNERD.py constants.py db.py dbqueries.py metrics.py propertyExtraction.py terms validation/ validation_final.csv Long_data_manuscript/ training/ long_1.csv ... test/ long_101.csv ...
Install python3: https://www.python.org/downloads/
Then run the following commands from your terminal while in this repo. Setup your virtual environment (only need to do once)
python3 -m venv brain
source brain/bin/activate
pip install sklearn
pip install tabulate
Make sure you are in your virtual environment (you should see (brain) to the left of your terminal) If you are not in your virtual environment, start by running:
source brain/bin/activate
Before running, make sure you edit the values in constants.py to point to your data, validation, etc.
- TERM_FILE
- LONG_DIR
- VALIDATION_DIR
- START_NUM
- DOC_COUNT
python BrainNERD.py
We have built in a method for running queries against the created Database. Start by creating a file in the /queries folder lets say query1.txt
Input an injury term followed by (+, -, or p) to denote whether you want to query for that term as present, not present, or possible. Each of these query parameters will be joined by an AND in the code. The file may look like below, which would mean to return all documents which have:
Hemorrhage +
Stroke -
Hydrocephalus -
Then simply run either the BrainNERD.py for just summary info. This will output into folder /queries-output. This error can be disregarded "The error 'UNIQUE constraint failed: documents.id' occurred"
python BrainNERD.py
- query1-output.txt: contains the document summary and aggregate information of each injury.
Or run propertyExtraction.py for injury property info as well. This will output into folder /queries-output-properties. This error can be disregarded "The error 'UNIQUE constraint failed: documents.id' occurred"
python propertyExtraction.py
- query1-doc.csv: contains the document summary for the documents which match your query
- query1-injury.csv: contains all of the properties related to all injuries in the document which matches your query.
Install Sqlite: https://www.tutorialspoint.com/sqlite/sqlite_installation.htm
Run the following commands to enter the sqlite terminal, load the output from brainclass, and set some settings. The final command will show you how the table was created and all of the column names.
sqlite3 data.sql
.headers ON
.mode column
.schema documents
select * from documents;
Get document IDs according to some constraints. Notice that the last constraint is in parentheses to mean that stroke is either negative or not mentioned.
select id from documents WHERE Hemorrhage = "+" AND Edema = "p" AND (Stroke = "-" OR Stroke = "0");
select count(*) from documents WHERE Hemorrhage = "+" AND Edema = "p" AND (Stroke = "-" OR Stroke = "0");
If you use the model in your publications, please cite this Paper: https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2795179