/DeepRC

DeepRC: Immune repertoire classification with attention-based deep massive multiple instance learning

Primary LanguagePython

Modern Hopfield Networks and Attention for Immune Repertoire Classification

Michael Widrich1, Bernhard Schäfl1, Milena Pavlović3 4, Hubert Ramsauer1, Lukas Gruber1, Markus Holzleitner1, Johannes Brandstetter1, Geir Kjetil Sandve4, Victor Greiff3, Sepp Hochreiter1 2, Günter Klambauer1

(1) ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria
(2) Institute of Advanced Research in Artificial Intelligence (IARAI)
(3) Department of Immunology, University of Oslo, Oslo, Norway
(4) Department of Informatics, University of Oslo, Oslo, Norway

Paper: https://arxiv.org/abs/2007.13505

Quickstart

conda

Conda setup:

conda env create -f condal_install.yml --name deeprdc_env
conda activate deeprc_env

pip

Alternatively, can install via pip:

pip install --no-dependencies git+https://github.com/widmi/widis-lstm-tools
pip install git+https://github.com/ml-jku/DeepRC

To update your installation with dependencies, you can use:

pip install --no-dependencies git+https://github.com/widmi/widis-lstm-tools
pip install --upgrade git+https://github.com/ml-jku/DeepRC

To update your installation without dependencies, you can use:

pip install --no-dependencies git+https://github.com/widmi/widis-lstm-tools
pip install --no-dependencies --upgrade git+https://github.com/ml-jku/DeepRC

Usage

Training DeepRC on pre-defined datasets

You can train a DeepRC model on the pre-defined datasets of the DeepRC paper using one of the Python files in folder deeprc/examples. The datasets will be downloaded automatically.

You can use tensorboard --logdir [results_directory] --port=6060 and open http://localhost:6060/ in your web-browser to view the performance.

Real-world data with implanted signals

This is category has the smallest dataset files and is a good starting point. Training a binary DeepRC classifier on dataset "0" of category "real-world data with implanted signals":

python3 -m deeprc.examples.simple_cmv_with_implanted_signals 0 --n_updates 10000 --evaluate_at 2000

To get more information, you can use the help function:

python3 -m deeprc.examples.simple_cmv_with_implanted_signals -h
LSTM-generated data

Training a binary DeepRC classifier on dataset "0" of category "LSTM-generated data":

python3 -m deeprc.examples.simple_lstm_generated 0
Simulated immunosequencing data

Training a binary DeepRC classifier on dataset "0" of category "simulated immunosequencing data":

python3 -m deeprc.examples.simple_lstm_generated 0

Warning: Filesize to download is ~20GB per dataset!

Real-world data

Training a binary DeepRC classifier on dataset "real-world data":

python3 -m deeprc.examples.simple_cmv

Training DeepRC on a custom dataset

You can train DeepRC on custom text-based datasets, which will be automatically converted to hdf5 containers. Specifications of the supported formats are give here: deeprc/datasets/README.md

from deeprc.deeprc_binary.dataset_readers import make_dataloaders
from deeprc.deeprc_binary.architectures import DeepRC
from deeprc.deeprc_binary.training import train, evaluate

# Let's assume this is your dataset metadata file
metadatafile = 'custom_dataset/metadata.tsv'

# Get data loaders from text-based dataset (see `deeprc/datasets/README.md` for format)
trainingset, trainingset_eval, validationset_eval, testset_eval = make_dataloaders(
    metadatafile, target_label='status', true_class_label_value='+', id_column='ID', 
    single_class_label_columns=('status',), sequence_column='amino_acid',
    sequence_counts_column='templates', column_sep='\t', filename_extension='.tsv')

# Train a DeepRC model
model = DeepRC(n_input_features=23, n_output_features=1, max_seq_len=30)
train(model, trainingset_dataloader=trainingset, trainingset_eval_dataloader=trainingset_eval,
      validationset_eval_dataloader=validationset_eval, results_directory='results')

# Evaluate on test set
roc_auc, bacc, f1, scoring_loss = evaluate(model=model, dataloader=testset_eval)

print(f"Test scores:\nroc_auc: {roc_auc:6.4f}; bacc: {bacc:6.4f}; f1:{f1:6.4f}; scoring_loss: {scoring_loss:6.4f}")

Note that make_dataloaders() will automatically create a hdf5 container of your dataset. Later, you can simply load this hdf5 container instead of the text-based dataset:

from deeprc.deeprc_binary.dataset_readers import make_dataloaders
# Get data loaders from hdf5 container
trainingset, trainingset_eval, validationset_eval, testset_eval = make_dataloaders('dataset.hdf5')

You can use tensorboard --logdir [results_directory] --port=6060 and open 'http://localhost:6060/' in your web-browser to view the performance.

Structure

deeprc
      |--datasets : stores datasets
      |   |--README.md : Information on supported dataset formats
      |--deeprc_binary : DeepRC implementation for binary classification
      |   |--architectures.py : DeepRC network architecture
      |   |--dataset_converters.py : Converter for text-based datasets
      |   |--dataset_readers.py : Tools for reading datasets
      |   |--predefined_datasets.py : Pre-defined datasets from paper
      |   |--training.py : Tools for training DeepRC model
      |--examples : DeepRC examples

Note

We are currently cleaning up and uploading the code for the paper. Baseline methods, contribution analysis, LSTM embedding, and other features will follow soon.

Requirements