salesforce/progen

Sampling conditional token distribution

aiXander opened this issue · 1 comments

It would be super valuable to have an example script to sample conditional token probabilities for a target index given sequence context.

There seem to be some technical details that are important, but not easy to figure out:

Finally, the way I'm currently evaluating mutations is by sequentially computing sequence likelihoods for each possible mutated sequence, so this takes 20 forward passes per single point mutation. But I think this is vastly inefficient, since the model produces logits for every position, can the logits for the target index simply be used as a proxy for token probability?

a-mad commented

a couple notes:

  • we excluded the non-amino acid tokens when scoring sequences as they aren't relevant for variant prediction. it doesn't have that large of an effect however
  • the LMs released currently are traditional autoregressive decoders. left-to-right or right-to-left. there are ways to perform inpainting but would require restructuring/retraining
  • you can use the logits (or averaged logits from the N->C and C->N direction) for a target index but i've never validated this myself. it would be approximate and not fully use the remaining context of the protein which may be critical