/tnm-stage-classifier

Extract TNM cancer staging from pathology notes.

Primary LanguageJupyter NotebookMIT LicenseMIT

Generalizable and Automated Classification of TNM Stage from Pathology Reports with External Validation

This repository holds the implementation of Big Bird TNM Extraction from Notes (BBTEN).

A prepreprint for the study can be found at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10327265/

Cancer staging is an essential clinical attribute informing patient prognosis and clinical trial eligibility. However, it is not routinely recorded in structured electronic health records. Here, we present a generalizable method for the automated classification of TNM stage directly from pathology report text. We train a BERT-based model using publicly available pathology reports across approximately 7,000 patients and 23 cancer types. We explore the use of different model types, with differing input sizes, parameters, and model architectures. Our final model goes beyond term-extraction, inferring TNM stage from context when it is not included in the report text explicitly. As external validation, we test our model on almost 8,000 pathology reports from Columbia University Medical Center, finding that our trained model achieved an AU-ROC of 0.815–0.942. This suggests that our model can be applied broadly to other institutions without additional institution-specific fine-tuning

Models

Models generated by this study can be found on Hugging Face:

https://huggingface.co/jkefeli/CancerStage_Classifier_T
https://huggingface.co/jkefeli/CancerStage_Classifier_N
https://huggingface.co/jkefeli/CancerStage_Classifier_M

We have included a small dataset, the T14 TCGA pathology report held-out test set, to demonstrate the utility and ease-of-use of the trained models. Please see the Demo folder for data and code.

Requirements

The following python package versions were used in model training and testing:

numpy==1.26.4
pandas==2.2.2
scikit-learn==1.4.2
scipy==1.13.0
seaborn==0.11.2
transformers==4.40.2
torch==2.3.0

The following python package versions were used in llama3 fine tuning:

accelerate==0.30.0
bitsandbytes==0.43.1
evaluate==0.4.2
huggingface-hub==0.23.0
peft==0.10.0

Use

To apply one of the TNM models to an external dataset, use the code provided in the Demo jupyter notebook. Replace the dataset and ensure that the target labels in the new dataset are the same as those for the trained models (T14, N03, M01). Ensure that the python packages used locally are the same as those outlined above.

How to cite

Kefeli J, Tatonetti N. Generalizable and Automated Classification of TNM Stage from Pathology Reports with External Validation. medRxiv [Preprint]. 2023 Jun 27:2023.06.26.23291912. doi: 10.1101/2023.06.26.23291912. PMID: 37425701; PMCID: PMC10327265.

@article{kefeli2023tnmstage,
  title={Generalizable and Automated Classification of TNM Stage from Pathology Reports with External Validation},
  author={Jenna Kefeli and Nicholas Tatonetti},
  journal = {medRxiv},
  doi = {https://doi.org/10.1101/2023.06.26.23291912},
  volume={},
  number={},
  pages={},
  year={2023},
  publisher={}
}