/Protein_affinity

Primary LanguagePythonMIT LicenseMIT

Pre-training-based protein affinity prediction

1. Data preprocess

To get protein interaction Structured dataset, use preprocess.py to generate data.

For example, your target protein is 9606.ENSP00000217109

python preprocess.py --protein_idx 9606.ENSP00000217109 --out_dir data

You will get structured dataset in folder data, which has these columns:

  • item_id_a: target protein A, in this case, is 9606.ENSP00000217109
  • item_id_b: Protein B to interact with protein A.
  • sequence_a: Amino acid sequence of protein A.
  • sequence_b: Amino acid sequence of protein B.
  • label: 0 or 1. 1 means affinity, 0 otherwise.

2. Get embedding of amino acids

In order to get embedding vector of the protein, you can load and use a pretrained model as follows:

import torch
import esm

# Load ESM-1b model
model, alphabet = esm.pretrained.esm1b_t33_650M_UR50S()
batch_converter = alphabet.get_batch_converter()

# Prepare data (first 2 sequences from ESMStructuralSplitDataset superfamily / 4)
data = [
    ("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"),
    ("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
    ("protein2 with mask","KALTARQQEVFDLIRD<mask>ISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE"),
    ("protein3",  "K A <mask> I S Q"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Extract per-residue representations (on CPU)
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[33], return_contacts=True)
token_representations = results["representations"][33]

# Generate per-sequence representations via averaging
# NOTE: token 0 is always a beginning-of-sequence token, so the first residue is token 1.
sequence_representations = []
for i, (_, seq) in enumerate(data):
    sequence_representations.append(token_representations[i, 1 : len(seq) + 1].mean(0))