How can I use interpppid using multifasta

Question

How can I use interpppid using multifasta

Closed this issue 4 months ago · 6 comments

I am following up on our last discussion thread here where you suggested I could use Interpppid for Plasmodium. I installed it yesterday using the following steps:

mamba create -n interppid python=3.10
mamba activate interppid
pip install intrepppid

# when I run Intrepppid I can only see train command. There is no test command as the document says 
NAME
    intrepppid - The INTREPPPID CLI

SYNOPSIS
    intrepppid COMMAND

DESCRIPTION
    The INTREPPPID CLI

COMMANDS
    COMMAND is one of the following:

     train

I can't see the test sub-command as I can see in your document help section

The website https://ppi.bio/infer/proteome/submit doesn't accept any sequence as when I hit the submit button, the page reloads with the empty box asking me to input the amino acid sequence again.

Is there a way to supply a multifasta file and a .tsv file that will specify pairs of proteins for which we wish to see if there is any protein-protein interactions like D-Script does?

Answer 1 · 2024-05-01T17:44:02.000Z

Hi @Rohit-Satyam ,

Thanks for your continued interest, and sorry to hear you've been having a hard time getting INTREPPPID to work for your use-case.

You're correct, there is no CLI interface for inference for INTREPPPID. Your two options are:

Use PPI.bio
Write a short python script with the INTREPPPID library.

I think the second option is likely your best bet for your specific use-case.

The 'test' sub-command in the documentation snuck into the docs (it existed in an incomplete state, I'll try to get it completed if I find the time). I'll have to remove it.

1. Using PPI.bio

PPI.bio only accepts a single amino acid sequence at a time (no FASTA headers). Inputting anything but that will result in an "Invalid Amino Acid Sequence" error:

I'll have to add a more descriptive error here, apologies.

Just a heads-up, the PPI.bio website currently computes the interactions between your provided amino acid sequence and all the proteins of one of the supported organisms found in the Uniprot database. These are currently H. sapiens, M. musculus, D. rerio, D. melanogaster, C. elegans, A. thaliana. More information can be found on the Help page. If you're looking to infer interactions between a plasmodium protein and the proteins of one of the organisms above, then PPI.bio will fit your use-case. Otherwise, you'll need to use option 2.

2. Write a short python script

The "Inference" section of the User Guide outlines how to use a Python script to infer pairwise protein interactions using INTREPPPID step-by-step:

https://emad-combine-lab.github.io/intrepppid/guide.html#inference

I'm realizing that I don't currently have pretrained weights currently available. Please allow me a couple of days to organize and upload them. Alternatively, you can follow the instructions to train INTREPPPID on the datasets I've made available.

Conclusion

I'll be making a few TODO issues here to:

Upload the pre-trained weights
Update the PPI.bio error message to be more descriptive
Remove mention of the 'test' sub-command in the documentation.

Thanks for your patience!

Answer 2 · 2024-07-25T08:40:04.000Z

Hi @jszym

I tried writing a short python script for inference using the example you have in your document but I am stuck at inference step:

sequence_pairs = [
   ("MANQRLS","MGPLSS"),
   ("MQQNLSS","MPWNLS"),
]

from intrepppid.data.ppi_oma import IntrepppidDataset
import sentencepiece as sp

trunc_len = 1500
spp = sp.SentencePieceProcessor(model_file="weights/spm.model")

encoded_sequence_pairs = []

for p1, p2 in sequence_pairs:
    x1 = IntrepppidDataset.static_encode(trunc_len, spp, p1)
    x2 = IntrepppidDataset.static_encode(trunc_len, spp, p2)

from intrepppid import intrepppid_network

# steps_per_epoch is 0 here because it is not used for inference
net = intrepppid_network(0)

net.eval()
    y_hat_logits = net(x1, x2)
    # The forward pass returns logits, so you need to activate with sigmoid
    y_hat = torch.sigmoid(y_hat_logits)

at step y_hat_logits = net(x1, x2) I get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 y_hat_logits = net(x1, x2)

File ~/miniconda3/envs/interppid/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File /data/phdProjects/project1/dl/annoSeqDl/tools/intrepppid/intrepppid/e2e/e2e_triplet.py:106, in TripletE2ENet.forward(self, x1, x2)
    105 def forward(self, x1, x2):
--> 106     z1 = self.encoder(x1)
    107     z2 = self.encoder(x2)
    109     y_hat = self.head(z1, z2)

File ~/miniconda3/envs/interppid/lib/python3.10/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File /data/phdProjects/project1/dl/annoSeqDl/tools/intrepppid/intrepppid/encoders/awd_lstm.py:149, in AWDLSTMEncoder.forward(self, x)
    147 def forward(self, x):
    148     # Truncate to the longest sequence in batch
--> 149     max_len = torch.max(torch.sum(x != 0, axis=1))
    150     x = x[:, :max_len]
    152     x = self.embedding_dropout(self.embedder, x, p=self.embedding_droprate)

TypeError: sum() received an invalid combination of arguments - got (numpy.ndarray, axis=int), but expected one of:
 * (Tensor input, *, torch.dtype dtype)
 * (Tensor input, tuple of ints dim, bool keepdim, *, torch.dtype dtype, Tensor out)
 * (Tensor input, tuple of names dim, bool keepdim, *, torch.dtype dtype, Tensor out)

The IntrepppidDataset returns a numpy array but the function net accepts tensor object. How do I get that. Besides, kindly add import torch in the document as it complaints when ran as is. I also want to understand what the use of encoded_sequence_pairs = []?

Answer 3 · 2024-07-27T13:44:06.000Z

Hi this code works for me.

import torch
from intrepppid.data.ppi_oma import IntrepppidDataset
from intrepppid import intrepppid_network
import sentencepiece as sp

# Example protein sequence pairs
sequence_pairs = [("MANQRLS", "MGPLSS"), ("MQQNLSS", "MPWNLS")]
trunc_len = 1500
spp = sp.SentencePieceProcessor(model_file="weights/spm.model")

encoded_sequence_pairs = []
for p1, p2 in sequence_pairs:
    x1 = torch.tensor(IntrepppidDataset.static_encode(trunc_len, spp, p1))
    x2 = torch.tensor(IntrepppidDataset.static_encode(trunc_len, spp, p2))
    encoded_sequence_pairs.append((x1, x2))

# Initialize network and perform inference
net = intrepppid_network(0)
net.eval()

output_file = "inference_results.txt"

with open(output_file, "w") as f:
    for (p1, p2), (x1, x2) in zip(sequence_pairs, encoded_sequence_pairs):
        x1 = x1.unsqueeze(0)  # Adding batch dimension
        x2 = x2.unsqueeze(0)  # Adding batch dimension
        y_hat_logits = net(x1, x2)
        y_hat = torch.sigmoid(y_hat_logits)
        result = f"Protein1: {p1}, Protein2: {p2} -> Interaction Probability: {y_hat.item()}\n"
        print(result)
        f.write(result)

print(f"Results saved to {output_file}")

I hope this helps and please confirm the prediction score for the mentioned example sequences:

Protein1: MANQRLS, Protein2: MGPLSS -> Interaction Probability: 0.46791645884513855
Protein1: MQQNLSS, Protein2: MPWNLS -> Interaction Probability: 0.46769070625305176

Answer 4 · 2024-07-28T02:50:07.000Z

@Rohit-Satyam thanks so much for your patience.

While I haven't had the time to verify @sheikhkayenat 's numbers, it generally looks fine, but for the omission of loading the model weights (the model here has random weights as best as I can tell). To load the weights, you must add

import torch
from intrepppid import intrepppid_network

# steps_per_epoch is 0 here because it is not used for inference
net = intrepppid_network(0)

net.eval()

chkpt = torch.load(CHECKPOINT_PATH)

net.load_state_dict(chkpt['state_dict'])

I do, however, have what I hope is good news for folks following this thread. I've added an infer command to INTREPPPID, and have just pushed it onto the repo. You can find details about it in the documentation, which I've amended to fix some issues pointed out by @Rohit-Satyam .

I hope this answers your question! I'll leave this issue open until you confirm things are working for you, don't hesitate to let me know if you have any questions.

Answer 5 · 2024-07-28T02:54:52.000Z

Whoops, that was the wrong link to the documentation. This is the correct one:
https://emad-combine-lab.github.io/intrepppid/guide.html#using-the-cli

Answer 6 · 2024-08-08T16:08:53.000Z

I'll be closing the issue for now, but don't hesitate to reopen or file a new issue if anything comes up.