Structure-Graph-Pair-Hackathon

Objective

To develop a code-base for graph deep learning models using protein structures as input, with a focus on residue-level prediction tasks and pair input (i.e protein-protein, protein-peptide, pocket-ligand, protein-ligand).

Dataset

The dataset consists of protein complexes with 2 sets of interacting chains, extracted from the MaSIF paper. Interface residues are defined as residues that have at least one heavy atom within a distance threshold (6A) from the other chain set. The train-test split is based on structure comparison of interfaces, so should be robust.

data/
    raw/
        PDB files of individual chain-sets (can be more than one chain in a chain-set)
    training.txt
        List of interacting chain-set pairs <pdbid>_<chainA>_<chainB>
    testing.txt
        List of interacting chain-set pairs <pdbid>_<chainA>_<chainB>
    interface_labels.txt
        Tab-separated list of <pdbid_chainA> <pdbid_chainB> <chainA interface chain and residue numbers> <chainB interface chain and residue numbers>

Environment setup

Download mamba

chmod +x Mambaforge.sh
./Mambaforge.sh

To use with pascal or rtx8000 GPU nodes:

srun --nodes=1 --cpus-per-task=8 --mem=16G --gres=gpu:1 --partition=rtx8000,pascal --pty bash
mamba create -n hackathon pytorch torchvision torchaudio pytorch-cuda=11.7 pyg -c pytorch -c nvidia -c pyg

Exit the interactive node

On the worker node:

conda activate hackathon
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.0.0+cu117.html
pip install pytorch-lightning tensorboard 'jsonargparse[signatures]'
pip install "graphein[extras] @ git+https://github.com/AGoetzee/graphein.git@fix_esm_embeddings"
pip install git+https://github.com/Ninjani/egnn-pytorch.git

Libraries

Graphein

Used to produce graphs of protein structures (and small molecules) with different methods for creating edges, and different node and edge features. The documentation website is sparse in some areas, look at the code instead. See utils/load_protein_as_graph for an example.

Pytorch-geometric:

Graph deep learning library

Provides

ready-to-use models - e.g the GAT in models/single_models/GATModel
transforms to add edges, node features, and edge features to your graphs - see dataloader/single_loader/ProteinDataModule
data and dataset objects for individual graphs (see dataloader/single_loader) and pairs of graphs with HeteroData (see dataloader/pair_loader)
Heterogenous graph learning for paired data - see models/pair_models/CrossGAT

Pytorch-lightning

LightningDataModule - see dataloader/single_loader/ProteinDataModule and dataloader/pair_loader/ProteinPairDataModule
LightningModule - see models/single_models/GATModel
logging to Tensorboard, checkpoints, early stopping, and other callbacks - see config.yaml and callbacks.py
Trainer - see config.yaml and main.py
Configuration - see config.yaml and sbatch_train.sh

egnn-pytorch

E(3)-Equivariant Graph Neural Networks

Usage:

from egnn_pytorch import EGNN_Sparse
egnn_layer = EGNN_Sparse(feats_dim=in_channels)
# Assume Data object with data.pos and data.x attributes given as input to `forward` function
new_x = torch.cat([pos, x], dim=-1)
new_x = egnn_layer(new_x, edge_index)
new_x = x[:, 3:]

Tasks

Data Preparation
- Construct graphs from protein structures
- Explore different featurisations
- Explore different ways of making cross-edges for paired data
- Load and batch data
Architecture
- Explore off-the-shelf graph-based models
- Incorporate pair architectures (e.g cross-attention)
- Incorporate EGNN layers
- Implement evaluation metrics
Training and Tracking
- Train and validate the models
- Save and load model checkpoints
- Log, visualise and track model(s) performance
- Optimise and tune model hyperparameters

Different training sets

Model	Nodes	Edges	EGNN	Aux	Who
Baseline	ESM	Distance	0	0	Jay
Complex	Everything + Surface	Everything	3	0	Jay
Baseline-Aux	ESM	Distance	0	1	Jay
Complex-Aux	Everything + surface	Everything	3	1	Jay
Surface	Everything + Surface	Everything	0	0	Peter
no LLM	Everything - ESM	Everything	3	0	Lorenzo
Complex: LR + dropout	Evertyhing + surface	Everything	3	0	Elias
Baseline: LR + dropout	ESM	Distance	0	0	Elias
No embeddings	Everything - Meiler, Expasy, ESM	Everything	0	0	Arthur
Normalized embeddings	Evertyhing + normalized ESM (124)	Everything	0	0	Lorenzo

Ninjani/hackathon