/mlRECIST

mlRECIST is a machine learning classification algorithm that estimates RECIST using radiology text reports for retrospective research

Primary LanguagePythonOtherNOASSERTION

mlRECIST project

PyPI - Python Version PyPI - License

mlRECIST is a machine learning classification algorithm (deep natural language processing [NLP]) we developed that estimates Response Evaluation Criteria in Solid Tumors (RECIST) outcomes from radiology text reports. This model is NOT intended to replicate RECIST or be used in clinical practice or trials but rather is a tool for analysis of retrospective data.

This repository contains our open-source Python code for the model, example of the output (reduced data), and select statistical/ plotting files. We are unable to share the input data because it is protected health information (PHI). For details please see our manuscript:

Deep learning to estimate RECIST in patients with NSCLC treated with PD-1 blockade
Kathryn C. Arbour*1,2, Luu Anh Tuan*3, Jia Luo*1, Hira Rizvi1, Andrew J. Plodkowski4, Mustafa Sakhi5, Kevin Huang5, Subba R. Digumarthy6, Michelle S. Ginsberg4, Jeffrey Girshman4, Mark G. Kris1,2, Gregory J. Riely1,2, Adam Yala3, Justin F. Gainor^5, Regina Barzilay^3, and Matthew D. Hellmann^1,2 [accepted, in press] Cancer Discovery 2020. https://doi.org/10.1158/2159-8290.CD-20-0419

*Contributed equally, ^Contributed equally

Author Affiliations: 1 Thoracic Oncology Service, Memorial Sloan Kettering Cancer Center, New York, NY 2 Department of Medicine, Weill Cornell Medical Center, New York, NY 3 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute for Technology, Cambridge, MA 4 Department of Radiology, Memorial Sloan Kettering Cancer Center 5 Department of Medicine, Massachusetts General Hospital, Boston, MA 6 Department of Radiology, Massachusetts General Hospital, Boston, MA

Code Summary:

Code for the algorithm, figures and statistics written for this project include the following:

  • mlRECIST: TensorFlow-based fully connected natural language processing neural network (implementation details found in the manuscript)
  • Receiver operator characteristic (ROC) with area under the curve (AUC) estimates
  • Survival curves using Kaplan-Meier estimates
  • Waterfall plot
  • Scatter plot
  • Vertical stacked bar plot

Data Summary:

Data includes the following:

  • Reduced dataset of output from mlRECIST for the training, internal validation, and external validation sets

Installation:

The scripts are dependent on the following packages:

  • TensorFlow 1.15
  • Gensim
  • NumPy
  • Matplotlib
  • Pandas
  • Scikit-learn
  • Glove's embeddings glove.840B.300d

Usage:

Data format:

The data used (including input, ground truth, and accompanying information) was formatted as an Excel file (.xlsx) with these columns in this exact order:

  • Patient ID, anonymized
  • Treatment start date (MM/DD/YYYY)
  • Treatment setting (clinical trial, standard of care)
  • Objective Response per RECIST (CR, PR, SD, POD)
  • Date of radiologic progression-free survival (MM/DD/YYYY)
  • PFS censor (0, 1)
  • Scan timepoint (Baseline, ontx, progression) (ontx = on treatment, during treatment or prior to progression if stopped treatment)
  • Scan include? (Y, N) (defined as: was the scan included in RECIST read?)
  • Date of scan (MM/DD/YYYY)
  • Type of scan (CT, PET, MR)
  • Scan type specified (CT CH/ABD/PEL W/ CON, etc.)
  • Scan report text (the entirety of the text report)

Ultimately, the input for the algorithm is column:

  • Scan report text

The model estimates three RECIST outcomes of interest:

  • best overall response (BOR) (CR, PR, SD, POD)
  • progression (Y, N)
  • progression date (MM/DD/YYYY)

Specifically the following columns served as ground truth:

  • Objective Response per RECIST
  • Date of radiologic progression-free survival
  • PFS censor

Running code:

Predicting BOR:

To predict BOR, run the following command: (see the list of arguments below)

python ./model/src/test_predict_objective.py arguments(optional)

The prediction file is in the folder: log_test/predict_objective

Predicting progression (Y, N):

To predict progression (Y, N), run the following command:

python ./model/src/test_predict_progression.py arguments(optional)

The prediction file is in the folder: log_test/predict_progression

Predicting progression date:

To predict the progression date, run the following command:

python ./model/src/test_predict_date.py arguments(optional)

The prediction file is in the folder: log_test/predict_date

List of arguments and their format:

Path files:

  • Training data: --data_source path_to_training_data
  • Testing data: --data_test path_to_testing_data
  • Embedding file: --embedding path_to_embedding_file

Parameters of models:

  • Embedding size: --embedding_size value
  • Hidden dimension size: --hidden_dim value
  • Dropout: --dropout_keep_prob value
  • Learning rate: --learning_rate value
  • Batch size_ --batch_size value
  • Number of training epochs: --num_epochs value

The parameters can be tuned among: hidden dimension size (200, 300, 500), dropout (0.8, 0.9, 1.0), learning rate (0.001, 0.005, 0.01), and batch_size (1, 2, 4, 8).

Please contact the corresponding authors Regina Barzilay or Matthew D. Hellmann for any questions or comments regarding the paper.

Authors: Jia Luo (@luoj2) and Anh Tuan Luu (@tuanluu)