Python library for Trial Pathfinder, an AI framework to systematically evaluate clinical trial eligibility criteria. Functions provided by TrialPathfinder: encoding eligibility criteria, emulating existing trials under combinations of eligibility rules, evaluating individual eligibility rule with Shapley value and suggesting data-driven criteria.
Reference paper: Evaluating Oncology Trial Eligibility Criteria using Real-World Data and AI.
The package TrialPathfinder is available on PyPI
pip install TrialPathfinder
We also provide the option for manual installation: download this Github repository and run
cd TrialPathfinder/
python setup.py install --user
Here we give a quick guidance of using TrialPathfinder. More details see tutorial/tutorial.ipynb.
import TrialPathfinder as tp
###### Encode Eligibility Criteria #####
# Create cohort selection object
cohort = tp.cohort_selection(patientids, name_PatientID='PatientID')
# Add the data tables needed in the eligibility criterion
cohort.add_table(name_table1, table1)
# Add individual eligibility criterion
cohort.add_rule(rule1)
###### Emulate Existing Trials and Survival Analysis ######
# Given a combination of eligibility rules names_rules (an empty list name_rules=[] indicates fully-relaxed criteria)).
HR, CI, data_cox = tp.emulate_trials(cohort, features, drug_treatment, drug_control, name_rules)
###### Evaluate Individual Criterion ######
# Return the Shapley values for each rule in names_rules
shapley_values = tp.shapley_computation(cohort, features, drug_treatment, drug_control, names_rules)
###### Criteria Relaxation - Data-driven Criteria ######
# Select all the rules with Shapley value less than 0.
names_rules_relax = names_rules[shapley_values<0.]
# Survival analysis on the data-driven criteria
HR, CI, data_cox = tp.emulate_trials(cohort, features, drug_treatment, drug_control, name_rules_relax)
We highly recommend reading the tutorial/tutorial.ipynb.
TrialPathfinder reads tables in Pandas dataframe structure (pd.dataframe) as default. The date information should be read as datetime (use function pd.to_datetime to convert if not).
1. Features:
- Patient ID
- Treatment Information
- Drug name.
- Start date.
- Date of outcome. For example, if overall survival (OS) is used as metric, the date of outcome is the date of death. If progression-free survival (PFS) is used as metric, the date of outcome is the date of progression.
- Date of last visit. The patient's last record date of visit, used for censoring.
- Covariates (optional): adjusted to emulate the blind assignment, used by Inverse probability of treatment weighting (IPTW) or propensity score matching (PSM). Some examples: age, gender, composite race/ethnicity, histology, smoking status, staging, ECOG, and biomarkers status.
2. Tables used by eligibility criteria.
- Use the same Patient ID as the features table.
We built a computational workflow to encode the description of eligibility criteria in the protocols into standardized instructions which can be parsed by Trial Pathfinder for cohort selection use.
1. Basic logic.
- Name of the criteria is written in the first row.
- A new statement starts with “#inclusion” or “#exclusion” to indicate the criterion’s type. Whether to include patients who have missing entries in the criteria: “(missing include)” or “(missing exclude)”. The default choice is including patients with missing entries.
- Data name format: “Table[‘featurename’]”. For example, “demographics[‘birthdate’]” denotes column date of birth in table demographics.
- Equation: ==, !=, <, <=, >, >=.
- Logic: AND, OR.
- Other operations: MIN, MAX, ABS.
- Time is encoded as “DAYS(80)”: 80 days; “MONTHS(4)”: 4 months; “YEARS(3)”: 3 years.
Example: criteria "Age" - include patients more than 18 years old when they received the treatment.
Age
#Inclusion
features['StartDate'] >= demographics['BirthDate'] + @YEARS(18>
2. Complex rule with hierarchy.
- Each row is operated in sequential order
- The tables are prepared before the last row.
- The patients are selected at the last row.
Example: criteria "Platelets" - include patients whose platelet count ≥ 100 x 10^3/μL.
To encode this criterion, we follow the procedure:
- Prepare the lab table:
- Pick the lab tests for platelet count
- The lab test date should be within a -28 to +0 window around the treatment start date
- Use the record closest to the treatment start date to do selection.
- Select patients: lab value larger than 100 x 10^3/μL.
Platelets
#Inclusion
(lab['LabName'] == 'Platelet count')
(lab['TestDate'] >= features['StartDate'] - @DAYS(28) ) AND (lab['TestDate'] <= features['StartDate'])
MIN(ABS(lab['TestDate'] - features['StartDate']))
lab['LabValue'] >= 100