deep-multipit

This repository proposes end-to-end strategies for multimodal binary classification with attention weights. It gathers several PyTorch models as well as the scripts to reproduce the experiments from Vanguri et al, 2022 and from our study:

"Integration of clinical, pathological, radiological, and transcriptomic data improves the prediction of first-line immunotherapy outcome in metastatic non-small cell lung cancer"

Preprint: https://doi.org/10.1101/2024.06.27.24309583

Deep-multipit project

Installation

Dependencies

joblib (>= 1.2.0)
lifelines (>= 0.27.4)
numpy (>= 1.21.5)
pandas (= 1.5.3)
pyyaml (>= 6.0)
PyTorch (>= 2.0.1)
scikit-learn (>= 1.2.0)
scikit-survival (>= 0.21.0)
statsmodels (>= 0.13.5)
Tensorboard (>= 2.8.0, optional)
tqdm (>= 4.63.0)

Install from source

Clone the repository:

git clone https://github.com/sysbio-curie/deep-multipit

Multipit repository

This repository is paired with the multipit repository which provides a set of Python tools, compatible with scikit-learn, to perform multimodal late and early fusion on tabular data as well as scripts to reproduce the experiments from our study.

Run scripts

The code in this repository is run through Python scripts in scripts directory. You can either run existing scripts (e.g. train.py, test.py, or cross_validation.py) or create your own scripts, using all the tools available in dmultipit package (you can use and update the auxiliary functions available in [_utils.py]).

Using config files

Scripts are run with command lines, with .yaml configuration files. Configuration are divided into two files: a config file to define the multimodal model to use, located in config_architecture folder, and a config file for all the settings of the experiment we want to run (e.g., training, test, or cross-validation), located in config_experiment folder.

Run train script

python train.py -c config_architecture/config_model.yaml -e config_experiment/config_train.yaml

You can also resume training from a specific checkpoint using:

python train.py -r path_to_checkpoint -e config_experiment/config_train.yaml

Run test script

python test.py -r path_to_checkpoint -e config_experiment/config_test.yaml

Run cross-validation script

python cross_validation.py -c config_architecture/config_model.yaml -e config_experiment/config_cv.yaml

Run MSKCC scripts

We provide all the code and data to implement the DyAM model and reproduce the experiments from Vanguri et al, 2022.

You first need to unzip the compressed folder MSKCC.zip in the data directory. These raw data were extracted from synapse.

You can then run the scripts with the following command line:

python cross_validation.py -c config_architecture\config_late_MSKCC.yaml -e config_experiment\config_cv_MSKCC.yaml

Customization

The simplest and safest way to customize this project is by creating new Python scripts in the scripts directory and/or changing the settings in configurtion files.

For more complex customization you can also update the code in dmultipit package.

Load your own data

Define your own loading function If your data differ from the MSKCC dataset or the TIPIT dataset described in our study you can create your own loading function in dmultipit/dataset/loader.py which should take as input the pathes to the different unimodal data files and return the loaded data as well as the target to predict (see load_TIPIT_multimoda or load_MSKCC_multimoda).

Define your own multimodal dataset You can also create your own Dataset in dmultipit/dataset/dataset.py:

Inheritating from base.base_dataset.MultiModalDataset
Implementing abstract methods fit_process and fit_multimodal_process
If needed, implementing new transformers for pre-processing in dmultipit/dataset/transformers.py , inheritating from base.base_transformers.UnimodalTransformer or base.base_transformers.MultimodalTransformer

Customize multimodal models

You can implement new multimodal models in dmultipit/model/model.py, inheritating from base.base_model.BaseModel, and using elements (potentialy new ones) from dmultipit/model/attentions.py and dmultipit/model/embeddings.py.

You can also implement our own loss or perfomance metrics in dmultipit/model/loss.py and dmultipit/model/metric.py respectively.

Customize training and testing

Trainer and Testing classes can be updated in dmultipit/trainer/trainer.py and dmultipit/testing/testing.py respectively.

Results folder architecture

By default, the results of the diffent experiments will be saved with the following folder architecture:

  results_folder/
  │
  ├── train/ - results from train.py script
  │   ├── log/ - logdir for tensorboard and logging output
  │   │    └── model_name/
  │   │             └── run_id/ - training run id
  │   └── models/ - trained models are saved here
  │        └── model_name/
  │                 └── run_id/ - training run id
  │
  ├── test/ - results from test.py script
  │   ├── log/ - logdir for logging output
  │   │    └── model_name/
  │   │             └── run_id/ - training run id associated with the model to test
  │   │                     └── exp_id/ - testing experiment id
  │   └── saved/ - prediction results, embeddings, and attention weights are saved here
  │        └── model_name/
  │                 └── run_id/ - training run id associated with the model to test
  │                         └── exp_id/ - testing experiment id
  │
  └── cross-val/ - results from cross_validation.py script
      ├── log/ - logdir for tensorboard and logging output
      │    └── model_name/
      │             └── exp_id/ - cv experiment id
      └── models/ - trained models are saved here
      │    └── model_name/
      │             └── exp_id/ - cv experiment id
      └── saved/ - cross-validation results (i.e., predictions for the different folds and repeats) are saved here
           └── model_name/
                    └── exp_id/ - cv experiment id

The name and location of the results_folder needs to be specified in the config_experiment.yaml file with the save_dir parameter.
The model name needs to be specified in the config_architecture.yaml file with the name parameter.
The training run id needs to be specified in the config_experiment.yaml file with the run_ind parameter (timestamp will be used if no id is specified).
The experiment id needs to be specified in the config_experiment.yaml file with the exp_id parameter (timestamp will be used if no id is specified).

Note: For modifying this default architecture you can update the initialization of the ConfigParser class in dmultipit/parse_config.py

Tensorboard visualization

Tensorboard visulaization is available with this project:

Make sure to install Tensorboard first (either using pip install tensorboard or TensorboardX).
Make sure that tensorboard setting is turned on in your config_experiment.yaml file before running your training script (i.e., tensorboard: true).
Run tensorboard --logdir path_to_logdir and server will open at http://localhost:6006.

Note: Tensorboard visualization is also available for cross-validation experiments, but we do not recommend to turn it on in case of multiple repeats of the cv scheme (i.e., n_repeats > 1 in config_experiment.yaml file).

Acknowledgements

This repository was created as part of the PhD project of Nicolas Captier in the Computational Systems Biology of Cancer group and the Laboratory of Translational Imaging in Oncology (LITO) of Institut Curie.

This repository was inspired by the Pytorch Template Project by Victor Huang.