Overall Survival of Clinical Trials: Classification models to predict early trial termination
Clinical trials can end early for a variety of reasons such as low accrual, interim analysis suggesting the intervention has low efficacy, adverse events and loss of funding or interest. I wanted to see if I could create a classification model to predict whether or not a trial would be terminated early or completed.
Clinical trails in the United States are required to be reported clinicaltrials.gov, however many still go unreported or are missing information. I used the clinicaltrials.gov API to collect data on study design, outcome measures, eligibility, investigators/sponsor and study locations for over 18,000 cancer interventional trials designated as 'Terminated' or 'Completed' from clinicaltrials.gov. This data was stored in a PostgresSQL database.
I one hot encoded the categorial data fields and engineered several new features using regex and text extraction from the free text fields, resulting in 400 total features.
Model optimization was performed using:
- scikit-learn
- imblearn
- xgboost
Models tested:
- kNN
- Logistic Regression
- SVC
- Naive Bayes
- Random Forest
- XGBoost
- Ensembled models
I used standardscaler to normalize the data and kNN imputation to impute values for some features with missing values. Only about 1/3 of the trials in the dataset were 'Terminated' causing a class imbalance, so I used either ADASYN oversampling or balanced model class weights when available. Models were optimized with gridsearch and most models reached similar F1 scores and AUCs for calling the "Terminated" class of ~0.4 and ~0.65, respectively. I acheived mild class seperation and a recall of 60-70% for "Terminated" trials. Ensembling only improved scores for a combination of kNN and Logistic Regression. Overall XGBoost performed the best.
I made a Streamlit app to allow users to interact directly with the logistic regression model and see how almost all the features affect the predictions for trial termination.
The final project presentation is below: