zjunlp/OntoProtein

Generating Embedding of Protein Sequence

anonimoustt opened this issue · 11 comments

Hi is it possible to use https://huggingface.co/zjunlp/OntoProtein model to get the embedding of a protein sequence?

Hi is it possible to use https://huggingface.co/zjunlp/OntoProtein model to get the embedding of a protein sequence?

Yes, you can use the model to get the embedding of a protein sequence by applying mean pooling on the hidden states after the encoder.

from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModel,EsmTokenizer, EsmModel
import torch
import numpy as np
from sklearn.pipeline import Pipeline#GrimSqueaker/proteinBERT zjunlp/OntoProtein

tokenizer = AutoTokenizer.from_pretrained("zjunlp/OntoProtein")#,config=config,max_position_embeddings=320)#,embeddings=vx)#,graph=s3)
model = AutoModel.from_pretrained("zjunlp/OntoProtein")#,config=config)#max_position_embeddings=320)
#tokenizer.add_tokens(useq)
#model.resize_token_embeddings(320)

def preem(seqf):
# Tokenize protein sequences
inputs1 =tokenizer(seqf, padding=True, truncation=True, return_tensors='pt',max_length=320)
inputs={}
for kk in inputs1:
if kk!='token_type_ids':
inputs[kk]=inputs1[kk]
# Compute token embeddings
#model.resize_token_embeddings(320)
with torch.no_grad():
outputs = model(**inputs1)
last_hidden_states = outputs.last_hidden_state
em =last_hidden_states #F.normalize(

Here last_hidden_states is giving embedding of size 1024 using ontoprotein. But is it possible to resize the vector size to 320 or reduced size?

Sorry, there might be methods to compress the vector, but we are not certain how much information loss this could cause.

I think UMAP works fine here.

Hi
is it possible get the score of a protein and score would be the weight of the sequence. Higher the weight, the protein would be more important. For instance, protein, P1, and protein, P2 . P2 has weight 0.9 and P1 has weight 0.85. P2 is more important sequence as it has higher weight. Can Onto-protein define the weight s to the sequences ?

Sorry, our method could not provide the importance of proteins.

zxlzr commented

hi, do you have further questions?

Hi,
In https://www.zjukg.org/project/ProteinKG25/ knowledge graph data I see the relation2id file where there are different types of relation. Is it possible to learn which relation are closer to phosphorylation function. Secondly, I see NOT|located_in type relation, so there are negative relation right ?

Hi,
In https://www.zjukg.org/project/ProteinKG25/ knowledge graph data I see the relation2id file where there are different types of relation. Is it possible to learn which relation are closer to phosphorylation function. Secondly, I see NOT|located_in type relation, so there are negative relation right ?

Hi,

I see the following relations in the knowledge graph:
['enables_nucleotide_binding', 'enables_metal_ion_binding', 'enables_transferase_activity', 'enables', 'involved_in_signal_transduction', 'involved_in_regulation_of_transcription,_DNA-templated', 'involved_in_phosphorylation', 'involved_in', 'part_of_nucleus', 'part_of_cytoplasm', 'part_of', 'part_of_cytosol', 'part_of_membrane', 'colocalizes_with', 'involved_in_proteolysis', 'NOT|involved_in', 'part_of_integral_component_of_membrane', 'involved_in_cation_transport', 'involved_in_cellular_response_to_DNA_damage_stimulus', 'part_of_mitochondrion', 'involved_in_metabolic_process', 'involved_in_cell_cycle', 'involved_in_cell_division', 'involved_in_lipid_metabolic_process', 'enables_RNA_binding', 'acts_upstream_of_or_within', 'enables_catalytic_activity', 'enables_hydrolase_activity', 'enables_DNA_binding', 'contributes_to', 'involved_in_carbohydrate_metabolic_process', 'involved_in_translation', 'part_of_extracellular_region', 'acts_upstream_of_or_within_positive_effect', 'involved_in_protein_transport', 'NOT|enables', 'acts_upstream_of', 'part_of_ribosome', 'involved_in_transmembrane_transport', 'NOT|part_of', 'NOT|involved_in_tRNA_processing', 'is_active_in', 'located_in', 'NOT|located_in', 'acts_upstream_of_positive_effect']

which relation is the most important for protein sequence?