NREL/EvoProtGrad

is it possible to get the importance score of the protein sequence?

anonimoustt opened this issue · 9 comments

I was just wondering is it possible to get the importance score of the protein sequence using EvoProtGrad model? For instance, in https://huggingface.co/datasets/waylandy/phosformer_curated data there are kinase enzymes. Now I want to rank the kinase enzymes based on importance scores.

Furthermore, I found in (https://colab.research.google.com/drive/1e8WjYEbWiikRQg3g4YHQJJcpvTIWVAjp?usp=sharing) that the scores are generated for different variants of a protein sequence. But what is the score of the original protein sequence ? If the score of original sequence can be measured then it can be compared with other variants?

Hi, sorry for the delay in getting back to you!

The score of the original protein sequence (i.e., the wild type sequence specified via the wt_fasta or wt_protein arguments of the DirectedEvolution sampler class), is stored in this wt_score attribute within each expert. Each expert uses this wt_score to compute the relative score of a variant with respect to the wild type.

As to getting importance scores of each variant, the DirectedEvolution sampler will return both the list of variants and their corresponding scores as a tuple. You can see in the demo notebook--when the output argument is set to "all", the scores tensor will have shape [parallel_chains, steps], and it's up to you to decide whether to grab the last score for each variant (scores[:,-1]) or the best, etc.

It is not clear. Specifically, from the code
variants, scores = evo_prot_grad.DirectedEvolution(
wt_protein = wildtype_sequence,
output = 'best', # return best, last, all variants
experts = [expert], # list of experts to compose
parallel_chains = 2, # number of parallel chains to run
n_steps = 100, # number of MCMC steps per chain
max_mutations = -1, # maximum number of mutations per variant
preserved_regions = None, # List of regions (start,end) to preserve
verbose = False # print debug info to command line
)()

wtseq = ' '.join(wildtype_sequence.strip())

for v,s in zip(variants,scores):
evo_prot_grad.common.utils.print_variant_in_color(v, wtseq)
print(s)

if I set output = 'all', then I will get the original sequence with score along with variant right?

No, scores will only contain a score for each variant, even if output is set to all. Here, all refers to returning the intermediate scores of the variants at each sampling step. In this example, scores would have shape [2,100] since parallel_chains = 2 and n_steps = 100.
If having the wildtype sequence's score returned alongside the scores of each variant is useful, I can add that.

Hi,
Yes it would be helpful if the score of the original sequence can be determined. I did not understand scores would have shape [2,100]. I see the score in float number format. parallel_chains = 2 defines top two best variants based on score right. Would you please clarify?

Also how was the score computed? Are you taking embedding: let us say using ESM-2 model you are computing the embedding of original sequence, and its variants . Next, we are computing the cosine similarity?

I think it could help to spend a little time reading the documentation about what scores are in EvoProtGrad and how they are estimated: https://nrel.github.io/EvoProtGrad/getting_started/experts/#what-is-a-product-of-experts ! The score in EvoProtGrad is an unnormalized log probability. However, in practice we subtract the wild type sequence log prob from the variant log prob, so the score actually is a difference between log probs.

The shape of the scores tensor will vary depending on what you set the argument output to. If output = best or output = last, that means for each of the parallel_chains Markov chains, either the best/last (respectively) variants will be returned. Hence, scores has shape [parallel_chains]. When output = all, this means every variant produced by each Markov chain at each step 1..n_steps will be returned, hence scores has shape [parallel_chains, n_steps]. This is useful when entire distributions of "good" variants are desired instead of just point estimates of "good" variants.

Thanks. EvoProtGrad is really interesting. I am working on kinase domain sequences ( https://huggingface.co/datasets/waylandy/phosformer_curated/raw/main/curated/phosphosites_11mer_kinase_specific.tsv). EvoProtGrad might be interesting tool to get the variants of a kinase sequence for analysis.

Hi one more query: Can EvoProtGrad be used to detection significant connection between two protein sequences? Let us say, I have protein 1 and protein 2 two sequences. Now using EvoProtGrad I got the top 3 variants of protein1 and top 3 variants of protein 2. Then compute the similarity scores of the variants is it possible get the relational significance of the protein 1 and protein 2.

Hi ,

I see if parallel_chains = 5, then I see the 5 variants and the corresponding score. Higher the score means more closer to the original sequence?

Accessing a particular expert's score for a variant sequence is now easier in v0.2 https://github.com/NREL/EvoProtGrad/releases/tag/v0.2. You can now call get_model_output with an expert to get this particular expert's score https://nrel.github.io/EvoProtGrad/api/experts/.