Michael Widrich1, Bernhard Schäfl1, Milena Pavlović3 4, Hubert Ramsauer1, Lukas Gruber1, Markus Holzleitner1, Johannes Brandstetter1, Geir Kjetil Sandve4, Victor Greiff3, Sepp Hochreiter1 2, Günter Klambauer1
(1) ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria
(2) Institute of Advanced Research in Artificial Intelligence (IARAI)
(3) Department of Immunology, University of Oslo, Oslo, Norway
(4) Department of Informatics, University of Oslo, Oslo, Norway
Paper: https://arxiv.org/abs/2007.13505
Conda setup:
conda env create -f condal_install.yml --name deeprdc_env
conda activate deeprc_env
Alternatively, can install via pip:
pip install --no-dependencies git+https://github.com/widmi/widis-lstm-tools
pip install git+https://github.com/ml-jku/DeepRC
To update your installation with dependencies, you can use:
pip install --no-dependencies git+https://github.com/widmi/widis-lstm-tools
pip install --upgrade git+https://github.com/ml-jku/DeepRC
To update your installation without dependencies, you can use:
pip install --no-dependencies git+https://github.com/widmi/widis-lstm-tools
pip install --no-dependencies --upgrade git+https://github.com/ml-jku/DeepRC
You can train a DeepRC model on the pre-defined datasets of the DeepRC paper
using one of the Python files in folder deeprc/examples
.
The datasets will be downloaded automatically.
You can use tensorboard --logdir [results_directory] --port=6060
and
open http://localhost:6060/
in your web-browser to view the performance.
This is category has the smallest dataset files and is a good starting point. Training a binary DeepRC classifier on dataset "0" of category "real-world data with implanted signals":
python3 -m deeprc.examples.simple_cmv_with_implanted_signals 0 --n_updates 10000 --evaluate_at 2000
To get more information, you can use the help function:
python3 -m deeprc.examples.simple_cmv_with_implanted_signals -h
Training a binary DeepRC classifier on dataset "0" of category "LSTM-generated data":
python3 -m deeprc.examples.simple_lstm_generated 0
Training a binary DeepRC classifier on dataset "0" of category "simulated immunosequencing data":
python3 -m deeprc.examples.simple_lstm_generated 0
Warning: Filesize to download is ~20GB per dataset!
Training a binary DeepRC classifier on dataset "real-world data":
python3 -m deeprc.examples.simple_cmv
You can train DeepRC on custom text-based datasets,
which will be automatically converted to hdf5 containers.
Specifications of the supported formats are give here: deeprc/datasets/README.md
from deeprc.deeprc_binary.dataset_readers import make_dataloaders
from deeprc.deeprc_binary.architectures import DeepRC
from deeprc.deeprc_binary.training import train, evaluate
# Let's assume this is your dataset metadata file
metadatafile = 'custom_dataset/metadata.tsv'
# Get data loaders from text-based dataset (see `deeprc/datasets/README.md` for format)
trainingset, trainingset_eval, validationset_eval, testset_eval = make_dataloaders(
metadatafile, target_label='status', true_class_label_value='+', id_column='ID',
single_class_label_columns=('status',), sequence_column='amino_acid',
sequence_counts_column='templates', column_sep='\t', filename_extension='.tsv')
# Train a DeepRC model
model = DeepRC(n_input_features=23, n_output_features=1, max_seq_len=30)
train(model, trainingset_dataloader=trainingset, trainingset_eval_dataloader=trainingset_eval,
validationset_eval_dataloader=validationset_eval, results_directory='results')
# Evaluate on test set
roc_auc, bacc, f1, scoring_loss = evaluate(model=model, dataloader=testset_eval)
print(f"Test scores:\nroc_auc: {roc_auc:6.4f}; bacc: {bacc:6.4f}; f1:{f1:6.4f}; scoring_loss: {scoring_loss:6.4f}")
Note that make_dataloaders()
will automatically create a hdf5 container of your dataset.
Later, you can simply load this hdf5 container instead of the text-based dataset:
from deeprc.deeprc_binary.dataset_readers import make_dataloaders
# Get data loaders from hdf5 container
trainingset, trainingset_eval, validationset_eval, testset_eval = make_dataloaders('dataset.hdf5')
You can use tensorboard --logdir [results_directory] --port=6060
and
open 'http://localhost:6060/' in your web-browser to view the performance.
deeprc
|--datasets : stores datasets
| |--README.md : Information on supported dataset formats
|--deeprc_binary : DeepRC implementation for binary classification
| |--architectures.py : DeepRC network architecture
| |--dataset_converters.py : Converter for text-based datasets
| |--dataset_readers.py : Tools for reading datasets
| |--predefined_datasets.py : Pre-defined datasets from paper
| |--training.py : Tools for training DeepRC model
|--examples : DeepRC examples
We are currently cleaning up and uploading the code for the paper. Baseline methods, contribution analysis, LSTM embedding, and other features will follow soon.
- Python3.6.9 or higher
- Python packages:
- Pytorch (tested with version 1.3.1)
- numpy (tested with version 1.18.2)
- h5py (tested with version 2.9.0)
- dill (tested with version 0.3.0)
- pandas (tested with version 0.24.2)
- tqdm (tested with version 0.24.2)
- scikit-learn (tested with version 0.22.2.post1)
- requests (tested with version 2.21.0)
- tensorboard (tested with version 1.14.0)
- widis-lstm-tools (tested with version 0.4)