To develop a code-base for graph deep learning models using protein structures as input, with a focus on residue-level prediction tasks and pair input (i.e protein-protein, protein-peptide, pocket-ligand, protein-ligand).
The dataset consists of protein complexes with 2 sets of interacting chains, extracted from the MaSIF paper. Interface residues are defined as residues that have at least one heavy atom within a distance threshold (6A) from the other chain set. The train-test split is based on structure comparison of interfaces, so should be robust.
data/
raw/
PDB files of individual chain-sets (can be more than one chain in a chain-set)
training.txt
List of interacting chain-set pairs <pdbid>_<chainA>_<chainB>
testing.txt
List of interacting chain-set pairs <pdbid>_<chainA>_<chainB>
interface_labels.txt
Tab-separated list of <pdbid_chainA> <pdbid_chainB> <chainA interface chain and residue numbers> <chainB interface chain and residue numbers>
Download mamba
chmod +x Mambaforge.sh
./Mambaforge.sh
To use with pascal or rtx8000 GPU nodes:
srun --nodes=1 --cpus-per-task=8 --mem=16G --gres=gpu:1 --partition=rtx8000,pascal --pty bash
mamba create -n hackathon pytorch torchvision torchaudio pytorch-cuda=11.7 pyg -c pytorch -c nvidia -c pyg
Exit the interactive node
On the worker node:
conda activate hackathon
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.0.0+cu117.html
pip install pytorch-lightning tensorboard 'jsonargparse[signatures]'
pip install "graphein[extras] @ git+https://github.com/AGoetzee/graphein.git@fix_esm_embeddings"
pip install git+https://github.com/Ninjani/egnn-pytorch.git
Used to produce graphs of protein structures (and small molecules) with different methods for creating edges, and different node and edge features. The documentation website is sparse in some areas, look at the code instead. See utils/load_protein_as_graph
for an example.
Graph deep learning library
Provides
- ready-to-use models - e.g the
GAT
inmodels/single_models/GATModel
- transforms to add edges, node features, and edge features to your graphs - see
dataloader/single_loader/ProteinDataModule
- data and dataset objects for individual graphs (see
dataloader/single_loader
) and pairs of graphs with HeteroData (seedataloader/pair_loader
) - Heterogenous graph learning for paired data - see
models/pair_models/CrossGAT
- LightningDataModule - see
dataloader/single_loader/ProteinDataModule
anddataloader/pair_loader/ProteinPairDataModule
- LightningModule - see
models/single_models/GATModel
- logging to Tensorboard, checkpoints, early stopping, and other callbacks - see
config.yaml
andcallbacks.py
- Trainer - see
config.yaml
andmain.py
- Configuration - see
config.yaml
andsbatch_train.sh
E(3)-Equivariant Graph Neural Networks
Usage:
from egnn_pytorch import EGNN_Sparse
egnn_layer = EGNN_Sparse(feats_dim=in_channels)
# Assume Data object with data.pos and data.x attributes given as input to `forward` function
new_x = torch.cat([pos, x], dim=-1)
new_x = egnn_layer(new_x, edge_index)
new_x = x[:, 3:]
-
Data Preparation
- Construct graphs from protein structures
- Explore different featurisations
- Explore different ways of making cross-edges for paired data
- Load and batch data
-
Architecture
- Explore off-the-shelf graph-based models
- Incorporate pair architectures (e.g cross-attention)
- Incorporate EGNN layers
- Implement evaluation metrics
-
Training and Tracking
- Train and validate the models
- Save and load model checkpoints
- Log, visualise and track model(s) performance
- Optimise and tune model hyperparameters
Model | Nodes | Edges | EGNN | Aux | Who |
---|---|---|---|---|---|
Baseline | ESM | Distance | 0 | 0 | Jay |
Complex | Everything + Surface | Everything | 3 | 0 | Jay |
Baseline-Aux | ESM | Distance | 0 | 1 | Jay |
Complex-Aux | Everything + surface | Everything | 3 | 1 | Jay |
Surface | Everything + Surface | Everything | 0 | 0 | Peter |
no LLM | Everything - ESM | Everything | 3 | 0 | Lorenzo |
Complex: LR + dropout | Evertyhing + surface | Everything | 3 | 0 | Elias |
Baseline: LR + dropout | ESM | Distance | 0 | 0 | Elias |
No embeddings | Everything - Meiler, Expasy, ESM | Everything | 0 | 0 | Arthur |
Normalized embeddings | Evertyhing + normalized ESM (124) | Everything | 0 | 0 | Lorenzo |