/cleavage_benchmark

Code and dataset for paper "Proteasomal cleavage prediction: state-of-the-art and future directions"

Primary LanguagePython

Cleavage Prediction Benchmark

This repository contains the code and dataset for the paper Proteasomal cleavage prediction: state-of-the-art and future directions

Abstract

Epitope vaccines are a promising approach for precision treatment of pathogens, cancer, autoimmune diseases, and allergies. Effectively designing such vaccines requires accurate proteasomal cleavage prediction to ensure that the epitopes included in the vaccine trigger an immune response. The performance of proteasomal cleavage predictors has been steadily improving over the past decades owing to increasing data availability and methodological advances. In this review, we summarize the current proteasomal cleavage prediction landscape and, in light of recent progress in the field of deep learning, develop and compare a wide range of recent architectures and techniques, including long short-term memory (LSTM), transformers, and convolutional neural networks (CNN), as well as four different denoising techniques. All open-source cleavage predictors re-trained on our dataset performed within two AUC percentage points. Our comprehensive deep learning architecture benchmark improved performance by 1.7 AUC percentage points, while closed-source predictors performed considerably worse. We found that a wide range of architectures and training regimes all result in very similar performance, suggesting that the specific modeling approach employed has a limited impact on predictive performance compared to the specifics of the dataset employed. We speculate that the noise and implicit nature of data acquisition techniques used for training proteasomal cleavage prediction models and the complexity of biological processes of the antigen processing pathway are the major limiting factors. While biological complexity can be tackled by more data and, to a lesser extent, better models, noise and randomness inherently limit the maximum achievable predictive performance.

Repository Structure

  • data/ holds the .csv and .tsv train, evaluation, and test files, as well as a vocabulary file
  • params/ holds vocabulary and tokenization merges files
  • code/ contains all of the preparation, training, evaluation, and runtime configuration files
    • run_configs/ contains all possible argparse configs to execute training
    • args.py holds all argparse options
    • denoise.py implements the tested denoising methods
    • loaders.py defines the dataloaders for all subsequent training architectures
    • models.py implements all tested model architectures
    • prep_dataset.py shows how we split and prepared the raw data
    • processors.py implements the training loops for all architecture and denoising variants
    • run_train.py is the overall training script that takes the argparse options and executes training and evaluation
    • train_tokenizers.py is used to create the vocab and merges files under params/
    • utils.py features utility functions such as masking

Naming structure of runtime configs

  • All config files are named as follows: the applicable terminal, i.e. c or n, followed by the model architecture, e.g. bilstm, followed by the denoising method, e.g. coteaching
  • Example: c_bilstm_coteaching.cfg

Available model architectures

  • BiLSTM, called bilstm
  • BiLSTM with Attention, called bilstm_att
  • BiLSTM with pre-trained Prot2Vec embeddings, called bilstm_prot2vec
  • Attention enhanced CNN, called cnn
  • BiLSTM with ESM2 representations as embeddings, called bilstm_esm2
  • Fine-tuning of ESM2, called esm2
  • BiLSTM with T5 representations as embeddings, called bilstm_t5
  • Base BiLSTM with various trained tokenizers
    • Byte-level byte-pair encoder with vocabulary size 1000 and 50000, called bilstm_bppe1 and bilstm_bbpe50
    • WordPair tokenizer with vocabulary size 50000, called bilstm_wp50
  • BiLSTM with forward-backward representations as embeddings, called bilstm_fwbw

Available denoising architectures

  • Co-Teaching, called coteaching
  • Co-Teaching+, called coteaching_plus
  • JoCoR, called jocor
  • Noise Adaptation Layer, called nad
  • DivideMix, called dividemix

Achieved performances

Results of our new architectures benchmarked against themselves, including denoising methods

Performance Comparison of all models and denoising architectures for C- and N-terminal

Ablation analysis results of our best method, the BiLSTM

Ablation study results

Comparison of our best method, the BiLSTM, to other published methods (in % AUC)

Method C-Terminal N-Terminal
PCPS 51.3 50.0
PCM 64.5 52.4
NetChop 3.1 (20S) 66.1 52.7
NetChop 3.1 (C-term) 81.5 51.0
SVM* 84.8 73.2
PCM* 85.3 75.5
Logistic Regression* 86.2 76.2
NetCleave* 86.9 76.4
PUUPL* 87.2 78.0
Pepsickle* 88.1 78.9
Our BiLSTM (6+4) 89.8 80.6
Our BiLSTM (28+28) 92.8 89.4

* Method has been re-trained from scratch on our dataset.

For other methods, we used published pre-trained models (NetChop, PCM), or web-server functionality (PCPS).

Sources for model architectures and denoising approaches

LSTM Architecture

LSTM Attention Architecture

CNN Architecture

Prot2Vec Embeddings

FwBw Architecture

MLP Architecture

T5 Architecture

ESM2 Architecture

Noise Adaptation Layer

Co-teaching

  • Co-teaching loss function and training process adaptations are based on Han et al., 2018, and official implementation on Github

Co-teaching+

  • Co-teaching+ loss function and training process adaptations are based on Yu et al., 2019, and official implementation on Github

JoCoR

  • JoCoR loss function and training process adaptations are based on Wei et al., 2020, and official implementation on Github

DivideMix

  • DivideMix structure is based on Li et al., 2020, Github
  • As DivideMix was originally implemented for image data, we adjusted the MixMatch and Mixup part for sequential data, based on Guo et al., 2019
    • This part is directly implemented in the respective forward pass in the notebooks, and thus cannot be found in the DivideMix section

Sources for other published methods included in the benchmark

PCPS

PCM

NetChop 3.1 (20S and C-term)

SVM

Logistic Regression

NetCleave

PUUPL

Pepsickle