This repository contains the code and the report for the master thesis:
Predicting venous thromboembolic events in patients with cancer using a new machine learning paradigm
written by Pablo Álvarez, under the supervision of Oriol Pujol Vila (UB) and José Manuel Soria (Hospital de Sant Pau), submitted to the Facultat de Matemàtiques i Informàtica of the Universitat de Barcelona.
The rise of machine learning in the last decade has facilitated great advances in fields such as medicine, where very powerful models have been developed, capable of predicting certain medical conditions with an accuracy never seen before.
The present work is focused on predicting one of the leading causes of death among patients with cancer: venous thromboembolic events (VTE). Over the years, several statistical models based on clinical/genetic data have been developed, and have made it possible to create some risk assessment tools, like the Khorana score. However, none of them are based on machine learning.
In this way, we propose a new model that uses advanced machine learning techniques and is able to outperform all models currently available. Furthermore, the model is based on a very recent and promising learning paradigm that has barely been tested, hence it is a great opportunity for us to explore and evaluate it.
This breakthrough ultimately has an impact on the patient's quality of life, improving the ability to detect patients at high risk of developing a VTE, who would benefit from preventive treatment.
- The notebook
data_preprocessing.ipynb
is used to create the dataset we used for the experiments, which is not included here as it contains private data. It also includes a simple analysis of the data.
Learning Using Statistical Invariants (LUSI)
This work explores a new learning paradigm based on statistical invariants that act as a teacher during learning.
- The notebook
SVM_I.ipynb
was designed for the experiments performed with the LUSI approach, in order to test the SVM&I algorithm implemented inlusi.py
.
- The notebook
validating_results.ipynb
contains the validation of the results reported in the reference paper for the Khorana and TiC-Onco risk scores, using our own methodology.
- The notebook
improving_TiC_Onco_score.ipynb
collects all the experiments performed with machine learning models (including SVM_I), with all the results obtained to be reviewed if necessary.
We developed a model based on the LUSI paradigm that improves the TiC-Onco risk score results:
Baseline | Ours | |
---|---|---|
AUC | 0.68 | 0.71 |
Accuracy | 0.71 | 0.74 |
Sensitivity | 0.34 | 0.49 |
Specificity | 0.80 | 0.80 |
PPV | 0.27 | 0.34 |
NPV | 0.84 | 0.87 |