Implementation of FE5 variables using the konsepy NLP framework.
Konsepy works around the idea of a concept
. A concept
is a semantic category which might have multple
representations in text. For example, a concept might be SOCIAL_ISOLATION which targets any text describing this in
text (e.g., 'no friends', 'lacks social support', etc.). Another could be COUGHING which might be described in text as '
coughs', 'hacking', 'wheeze', etc. The selection of concepts will depend on the particular application. If you only care
about a single output category, it's sufficient to just have a single target concept.
Each concept
is assigned a set of regular expressions which are used to assign a concept to a section of text. These
regular expressions each receive an individual label.
- Python 3.9+
- Download/clone this project
- The path to this location will be referred to as
$PATH
in the instructions below (this might beC:\code
, etc.)
- The path to this location will be referred to as
- (Optional, but recommended) setup a virtual environment to isolate this particular installation
cd $PATH\fe5_konsepy
python -m venv .venv
- The full path to
python.exe
might need to be specified in this command
- The full path to
- Activate:
- Powershell:
.venv/scripts/activate.ps1
- Linux/Mac:
source .venv/bin/activate
- Powershell:
- Install required packages:
pip install requirements.txt
- OR
pip install .
- A corpus file
- In the future, other data sources will be included, for now it must be
csv
,jsonl
,sas7bdat
- Columns (these can be configured to use different names, but it's easiest if you select these names)
studyid
: (required) subject-level identifier; if not important/relevant, set all instances to1
note_id
: (required) note-level identifier; unique identifier for each notenote_text
: (required) text associated with each notenote_date
: (optional) date of note; not used by algorithm so probably easiest to ignorenote_line
: (optional) if note broken into multiple segments (see example insample/corpus_lined.csv
), specify this to join them- If using
note_line
, all portions of the note are assumed to appear together in the dataset (i.e., order bynote_id, note_line
)
- If using
- corpus may contain other columns which will be ignored
- In the future, other data sources will be included, for now it must be
- To run the pipeline on a CSV file:
python src/run_all.py --input-files sample/corpus.csv --outdir out --id-label studyid
- To run just the
smoking
pipeline on a CSV file:python src/run_concept.py --input-files sample/corpus.csv --outdir out --id-label studyid --concept smoking
- To run just the
suicide_attempt
pipeline on a CSV file: *python src/run_concept.py --input-files sample/corpus.csv --outdir out --id-label studyid --concept suicide_attempt
If you need to specify different headers/variables/labels to your corpus file, use the following command line arguments (the defaults are shown):
--id-label studyid
--noteid-label note_id
--notedate-label note_date
--notetext-label note_text
Once the target concepts have been identified, they will be output to CSV files in a the output directory starting with
the labels run_all_{YYYYMMDD_HHMMSS}
. Within this folder
- For patient-level information, consider
mrn_category_counts.csv
- For note-level information, consider
notes_category_counts.csv
The postprocessing step attempts to summarize the disparate counts/values extracted into a single concluding value to be populated to the FE Table. In certain cases, a single note might produce multiple output values (i.e., 1 note will equate to multiple lines in the FE table).
While it is recommended to run postprocessing at the note-level (notes_catgory_counts.csv
), you can perform
summarization at a different level of analysis. E.g., to analyze by day/encounter, we might need to first group multiple
notes together. The steps might be:
- Merge
notes_category_counts.csv
with a table that mapsnote_id
toencounter_id
- Group by
mrn
(orstudyid
) andencounter_id
and aggregate the counts withsum
.- NB:
0
is left empty rather than output, so you may need to transform these missing values to0
first
- NB:
- Run the postprocessing
There are two different files for running post-processing, depending on whether the target is smoking status or history of suicide attempt. They accept the same arguments. These will create a new file with all of the same metadata, but replace the extracted concepts with the following headers/variables/columns:
- feature: the corresponding CUI to the concept
- fe_codetype:
UC
(for UMLS CUI) - feature_status: any contextual conditions on the future (e.g., negated)
A
: affirmedX
: other subjectH
: historicalN
: negated
smoking
python src/postprocess_smoking.py --infile notes_category_counts.csv --outfile fe_table_smoking.csv
suicide_attempt
python src/postprocess_hx_attempted_suicide.py --infile notes_category_counts.csv --outfile fe_table_hxsa.csv
It is probably important to store the version information in a separate table, linking with a unique PipelineID
(or
FeatureID
as currently defined).
- Pipeline name: fe5_konsepy
- Source: https://github.com/kpwhri/fe5_konsepy
- Version: 0.0.3
- You can get this information by either:
- Looking in
src/fe5_konsepy/__init__.py
- Looking at the most recent version at the top of
CHANGELOG.md
- Running
git tag -l
and selecting the largest (likely bottom) version number
- Looking in
- You can get this information by either:
- Concept name: The name of the concept, equivalent to the name of the script
- Concept label: The label given to the output of the concept; concepts will appear in the output CSV as
CONCEPT_LABEL.CATEGORY
- Categories: The actual categories for each concept that are identified
- Each category will appear in the output CSV file as
CONCEPT_LABEL.CATEGORY
as the variable, and a count of occurences as the value
- Each category will appear in the output CSV file as
- Concept name:
suicide_attempt
- Concept label:
SuicideAttempt
- Concept definition: in specified note, had a suicide or self-harm attempt anytime in the past
- Categories:
YES
: evidence for past attemptNO
: denied past attemptFAMILY
: past attempt associated with someone else (no necessarily family member)CODE
: attempt identified from ICD code
- Post-processing:
C0455507
: History of attempted suicideA
: affirmativeN
: negatedX
: other subject (e.g., concept is relevant to a family member, not the patient)
- Concept name:
smoking
- Concept label:
SmokingCategory
- Concept definition: in specified note, evidence for current or historical smoking status
- Categories:
NO
: no evidence for smokingCURRENT
: current evidenceHISTORY
: historical evidenceNEVER
: evidence for never smokingYES
: evidence for smoking
- Post-processing:
C0337664
: Smoker - Persons with a history or habit of SMOKINGA
: affirmedN
: negatedH
: historical
C0337672
: Non-smokerA
: affirmed