Generating Embedding of Protein Sequence

Question

Generating Embedding of Protein Sequence

anonimoustt opened this issue a year ago · 11 comments

Hi is it possible to use https://huggingface.co/zjunlp/OntoProtein model to get the embedding of a protein sequence?

Answer 1 · 2024-01-21T23:23:48.000Z

Hi is it possible to use https://huggingface.co/zjunlp/OntoProtein model to get the embedding of a protein sequence?

Answer 2 · 2024-01-22T03:02:13.000Z

Yes, you can use the model to get the embedding of a protein sequence by applying mean pooling on the hidden states after the encoder.

Answer 3 · 2024-01-22T03:44:53.000Z

from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModel,EsmTokenizer, EsmModel
import torch
import numpy as np
from sklearn.pipeline import Pipeline#GrimSqueaker/proteinBERT zjunlp/OntoProtein

tokenizer = AutoTokenizer.from_pretrained("zjunlp/OntoProtein")#,config=config,max_position_embeddings=320)#,embeddings=vx)#,graph=s3)
model = AutoModel.from_pretrained("zjunlp/OntoProtein")#,config=config)#max_position_embeddings=320)
#tokenizer.add_tokens(useq)
#model.resize_token_embeddings(320)

def preem(seqf):
# Tokenize protein sequences
inputs1 =tokenizer(seqf, padding=True, truncation=True, return_tensors='pt',max_length=320)
inputs={}
for kk in inputs1:
if kk!='token_type_ids':
inputs[kk]=inputs1[kk]
# Compute token embeddings
#model.resize_token_embeddings(320)
with torch.no_grad():
outputs = model(**inputs1)
last_hidden_states = outputs.last_hidden_state
em =last_hidden_states #F.normalize(

Here last_hidden_states is giving embedding of size 1024 using ontoprotein. But is it possible to resize the vector size to 320 or reduced size?

Answer 4 · 2024-01-22T04:26:51.000Z

Sorry, there might be methods to compress the vector, but we are not certain how much information loss this could cause.

Answer 5 · 2024-01-22T05:14:41.000Z

I think UMAP works fine here.

Answer 6 · 2024-02-01T03:48:01.000Z

Hi
is it possible get the score of a protein and score would be the weight of the sequence. Higher the weight, the protein would be more important. For instance, protein, P1, and protein, P2 . P2 has weight 0.9 and P1 has weight 0.85. P2 is more important sequence as it has higher weight. Can Onto-protein define the weight s to the sequences ?

Answer 7 · 2024-02-05T12:55:56.000Z

Sorry, our method could not provide the importance of proteins.

Answer 8 · 2024-02-07T15:47:06.000Z

hi, do you have further questions?

Answer 9 · 2024-02-07T15:54:46.000Z

Hi,
In https://www.zjukg.org/project/ProteinKG25/ knowledge graph data I see the relation2id file where there are different types of relation. Is it possible to learn which relation are closer to phosphorylation function. Secondly, I see NOT|located_in type relation, so there are negative relation right ?

Answer 10 · 2024-02-08T15:45:28.000Z

Hi,
In https://www.zjukg.org/project/ProteinKG25/ knowledge graph data I see the relation2id file where there are different types of relation. Is it possible to learn which relation are closer to phosphorylation function. Secondly, I see NOT|located_in type relation, so there are negative relation right ?

Answer 11 · 2024-02-16T21:31:48.000Z

Hi,

I see the following relations in the knowledge graph:
['enables_nucleotide_binding', 'enables_metal_ion_binding', 'enables_transferase_activity', 'enables', 'involved_in_signal_transduction', 'involved_in_regulation_of_transcription,_DNA-templated', 'involved_in_phosphorylation', 'involved_in', 'part_of_nucleus', 'part_of_cytoplasm', 'part_of', 'part_of_cytosol', 'part_of_membrane', 'colocalizes_with', 'involved_in_proteolysis', 'NOT|involved_in', 'part_of_integral_component_of_membrane', 'involved_in_cation_transport', 'involved_in_cellular_response_to_DNA_damage_stimulus', 'part_of_mitochondrion', 'involved_in_metabolic_process', 'involved_in_cell_cycle', 'involved_in_cell_division', 'involved_in_lipid_metabolic_process', 'enables_RNA_binding', 'acts_upstream_of_or_within', 'enables_catalytic_activity', 'enables_hydrolase_activity', 'enables_DNA_binding', 'contributes_to', 'involved_in_carbohydrate_metabolic_process', 'involved_in_translation', 'part_of_extracellular_region', 'acts_upstream_of_or_within_positive_effect', 'involved_in_protein_transport', 'NOT|enables', 'acts_upstream_of', 'part_of_ribosome', 'involved_in_transmembrane_transport', 'NOT|part_of', 'NOT|involved_in_tRNA_processing', 'is_active_in', 'located_in', 'NOT|located_in', 'acts_upstream_of_positive_effect']

which relation is the most important for protein sequence?