This is a repository of the code used for the paper entitled Effective Ways to Build and Evaluate Individual Survival Distributions, published in the Journal of Machine Learning Research (JMLR). This code can be used to run 6 different survival prediction models (Kaplan-Meier, Cox proportional-hazards, Accelerated Failure Time, Random Survival Forests, Cox Elastic-Net, and Multi-task Logistic Regression) and 5 evaluate those models across 5 different metrics (Concordance, Single/Integrated Brier score, L1-loss, 1-Calibration, and D-Calibration).
If you are interested in using the datasets from the paper you can access the NACD (and subsequently the NACD-Col) data from the Patient Specific Survival Prediction (PSSP) website under "Public Predictors" or use this direct download link. Note that the here the censored bit is flipped from the notation in the paper (CENSORED = 1 implies a censored patient). The data from TCGA can be found by accessing TCGA's firebrowse website, selecting a cancer cohort via dropdown, and downloading the clinical features by selecting the blue bar and choosing the Clicnical_Pick_Tier1.md5
download link. All details for the High-Dimensional datasets can be found in the paper "A Multi-Task Learning Formulation for Survival Analysis" (left column of page 6) by Li et al. and the dataset downloadable links can be found on their authors website. Supplemental dataset details and results can be found on RPubs and their basic features and download links are given in the table below.
ACC | BLCA | CESC | CHOL | COAD | COADREAD | ESCA | FPPP | HNSC | KICH | KIPAN | KIRC | KIRP | LGG | LIHC | LUAD | LUSC | OV | PAAD | PRAD | SARC | SKCM | STAD | STES | THCA | UCEC | UCS | UVM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
N | 92 | 409 | 307 | 45 | 456 | 626 | 185 | 38 | 526 | 112 | 939 | 537 | 290 | 513 | 376 | 513 | 498 | 576 | 185 | 499 | 261 | 460 | 436 | 621 | 503 | 546 | 57 | 80 |
% Censored | 63.04 | 55.99 | 76.55 | 51.11 | 77.63 | 79.39 | 58.38 | 81.58 | 57.6 | 89.29 | 75.19 | 67.04 | 84.83 | 75.63 | 64.89 | 64.13 | 56.83 | 40.45 | 45.95 | 98 | 62.07 | 51.96 | 61.01 | 60.23 | 96.82 | 83.33 | 38.6 | 71.25 |
Number of Features | 9 | 13 | 15 | 5 | 12 | 13 | 13 | 3 | 20 | 7 | 20 | 14 | 12 | 4 | 12 | 14 | 12 | 2 | 10 | 6 | 5 | 18 | 17 | 18 | 14 | 6 | 4 | 6 |
A full tutorial of the code in this repository can be found on Humza Haider's RPubs site, however, we also give a brief example below. Before attempting to run any of the code please make sure you have installed all the required packages: caret
, dataPreparation
, ggplot2
, reshape2
,randomForestSRC
, Rcpp
,RcppArmadillo
, prodlim
, survival
, fastcox
, plyr
,and dplyr
. Once your working directory is in ISDEvaluation
, you can run the following
#First load all the code.
source('analysisMaster.R')
#We will use some example data from the "survival" library. For details on this dataset see 'help(lung)'.
survivalDataset = survival::lung
#If you run 'head(survivalDataset' you will see that the time to event feature is already named 'time'
#but the censor bit is named 'status' and has values of 2 for dead and 1 for censored. We change this
#name to 'delta' and set these values to 0 and 1 below.
names(survivalDataset)[c(3)] = c("delta")
survivalDataset$delta = survivalDataset$delta - 1
#Then we can run analysisMaster() to get the evaluation results for all the models.
ISD = analysisMaster(survivalDataset, numberOfFolds = 5)
analysisMaster()
should print it's progress as it continues to run (set verbose = FALSE to ignore this output). Once the model has finished running (a few minutes to allow the different models models to select their hyper parameters) ISD
will be a list contains three items: (1) datasetUsed
-- the survival dataset we passed in post validation and feature selection, (2) survivalCurves
-- a list of matrices where each column represents the survival probabilities for each patient (one matrix for every model), and (3) results
-- these are the evaluation results for each model.
You can plot these survival curves using plotSurvivalCurves()
; for example to see the first 10 survival curves (for the first 10 rows os ISD$datasetUsed) of the Multi-task Logistic Regression (MTLR) model you can use the following:
plotSurvivalCurves(ISD$survivalCurves$MTLR, 1:10)
The results
are a dataframe returning each models performance across the number of folds specified (here 5). Under the default settings these evaluations will be Concordance, Integrated Brier score, single time Brier score evaluated at the median event time, Margin-L1-loss, D-Calibration, and 1-Calibration evaluated at the 10th, 25th, 50th, 75th, and 90th percentiles of event times (for 1-Calibration these correspond to the column names OneCalibration_1, OneCalibration_2,...,OneCalibration_5, respectively). Note that while Kaplan-Meier reports values for 1-Calibration, these values are meaningless since Kaplan-Meier assigns equal probabilities for all patients.
Please email Humza Haider at hshaider@ualberta.ca with any comments/questions you may have.