/philter-ucsf

Open source clinical text de-identification

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

If you use this software for any publication, please cite: Norgeot, B., Muenzen, K., Peterson, T.A. et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. npj Digit. Med. 3, 57 (2020). https://doi.org/10.1038/s41746-020-0258-y

Installing Philter

To install Philter from PyPi, run the following command:

pip3 install philter-ucsf

The main philter code will be executed by running:

python3 -m philter_ucsf [flags, see below]

However, we strongly suggest that you download the project source code and run all sample commands below from the home directory before running the install version of Philter.

Installing Requirements

To install the Python requirements, run the following command:

pip3 install -r requirements.txt

Running Philter: A Step-by-Step Guide

Philter is a command-line based clinical text de-identification software that removes protected health information (PHI) from any plain text file. Although the software has built-in evaluation capabilities and can compare Philter PHI-reduced notes with a corresponding set of ground truth annotations, annotations are not required to run Philter. The following steps may be used to 1) run Philter in the command line without ground truth annotations, or 2) generate Philter-compatible annotations and run Philter in evaluation mode using ground truth annotations. Although any set of notes and corresponding annotations may be used with Philter, the examples provided here will correspond to the I2B2 dataset, which Philter uses in its default configuration.

Before running Philter either with or without evaluation, make sure to familiarize yourself with the various options that may be used for any given Philter run:

Flags:

-i (input):             Path to the directory or the file that contains the clinical note(s), the default is ./data/i2b2_notes/
-a (anno):            Path to the directory or the file that contains the PHI annotation(s), the default is ./data/i2b2_anno/
-o (output):         Path to the directory to save the PHI-reduced notes in, the default is ./data/i2b2_results/
-f (filters):            Path to the config file, the default is ./configs/philter_delta.json
-x (xml):               Path to the json file that contains all xml data, the default is ./data/phi_notes.json
-c (coords):         Output path to the json file that will contain the coordinate map data, the default is ./data/coordinates.json
-v (verbose):       When verbose is true, will emit messages about script progress. The default is True
-e (run_eval):       When run_eval is true, will run our eval script and emit summarized results to terminal
-t (freq_table):    When freqtable is true, will output a unigram/bigram frequency table of all note words and their PHI/non-PHI counts. Default is False
-n (initials):          When initials is true, will include annotated initials PHI in recall/precision calculations. The default is True
--eval_output:     Path to the directory that the detailed eval files will be outputted to, the default is ./data/phi/
--outputformat:  Define format of annotation, allowed values are "asterisk", "i2b2". Default is "asterisk"
--ucsfformat:      When ucsfformat is true, will adjust eval script for slightly different xml format. The default is False
--prod:                  When prod is true, this will run the script with output in i2b2 xml format without running the eval script. The default is False
--cachepos:         Path to a directoy to store/load the pos data for all notes. If no path is specified then memory caching will be used

0. Curating I2B2 XML Files

To remove non-HIPAA PHI annotations from the I2B2 XML files, run the following command:

-i Path to the directory that contains the original I2B2 xml files
-o Path to the directory where the curated files will be written

python improve_i2b2_notes.py -i data/i2b2_xml/ -o data/i2b2_xml_updated/

1. Running Philter WITHOUT evaluation (no ground-truth annotations required)

a. Make sure the input file(s) are in plain text format. If you are using the I2B2 dataset (or any other dataset in XML or other formats), the note text must be extracted from each original file and be saved in individual text files. Examples of properly formatted input files can be found in ./data/i2b2_notes/.

b. Store all input file(s) in the same directory, and create an output directory (if you want the PHI-reduced notes to be stored somewhere other than the default location).

c. Create a configuration file with specified filters (if you do not want to use the default configuration file).

d. Run Philter in the command line using either default or custom parameters.

Use the following command to run a single job and output files in XML format:

python3 main.py -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True

IMPORTANT NOTE: XML-formatted files do NOT have PHI-reduced text. Instead, they contain the original note text with the PHI tags identified by Philter.

If you'd like to output ONLY the PHI-reduced text with asterisks obscuring Philter-identified PHI, simply add the -outputformat "asterisk" option:

python3 main.py -i ./data/i2b2_notes/ -o ./data/i2b2_results/ -f ./configs/philter_delta.json --prod=True --outputformat "asterisk"

To run multiple jobs simultaneously, all input notes handled by a single job must be located in separate directories to avoid cross-contamination between output files. For example, if you wanted to run Philter on 1000 notes simultaneously on two processes, the two input directories might look like:

  1. ./data/batch1/500_input_notes_batch1/
  2. ./data/batch2/500_input_notes_batch2/

In this example, the following two commands would be used to start running each job in the background:

nohup python3 main.py -i ./data/batch1/500_input_notes_batch2/ -o ./data/i2b2_results_test/ -f ./configs/philter_delta.json --prod=True > ./data/batch1/batch1_terminal_out.txt 2>&1 &
nohup python3 main.py -i ./data/batch2/500_input_notes_batch2/ -o ./data/i2b2_results_test/ -f ./configs/philter_delta.json --prod=True > ./data/batch2/batch2_terminal_out.txt 2>&1 &

2. Running Philter WITH evaluation (ground truth annotations required)

a. Create Philter-compatible annotation files using the transformation script located in ./generate_dataset/. This script expects notes in xml format, and transforms each input file into two plain text files: 1) the original note text, and 2) the note text with asterisks obscuring PHI. A properly formatted xml input can be found in ./data/i2b2_xml, and examples of the two outputs can be found in ./data/i2b2_notes and ./data/i2b2_anno, respectively. Additionally, this script creates a .json file that contains the original text from each note, followed by the PHI annotations in json format. An example of this output file can be found at ./data/phi_notes_i2b2.json. This is the file that will be used as the -x default option.

Flags:

-x Path to the directory file that contains the note xml files
-o Path to the json file that will contain a summary of the phi in the xml files
-n Path to the directory where you would like to store the plain text notes
-a Path to the directory where you would like to store the plain text annotations

Use the following command to create these input files from notes in XML format:

python3 ./generate_dataset/main_ucsf_updated.py -x ./data/i2b2_xml/ -o ./data/phi_notes_i2b2.json -n ./data/i2b2_notes/ -a ./data/i2b2_anno/

Note: If this command produces an ElementTree.ParseError, you may need to remove .DS_Store from ./data/i2b2_xml.

b-c. See Step 1b-c above

d. Run Philter in evaluation mode using the following command:

python3 main.py -i ./data/i2b2_notes/ -a ./data/i2b2_anno/ -o ./data/i2b2_results/ -x ./data/phi_notes_i2b2.json -f=./configs/philter_delta.json --outputformat "asterisk"

By defult, this will output PHI-reduced notes (.txt format) in the specified output directory. If this command is used with the --outputformat i2b2 flag (or with no --outputformat specified, since i2b2 format is the default option), the evaluation script will not be run and the script will output notes with the original text and the Philter PHI tags (.xml format) in the specified output directory.