Sampling conditional token distribution
aiXander opened this issue · 1 comments
It would be super valuable to have an example script to sample conditional token probabilities for a target index given sequence context.
There seem to be some technical details that are important, but not easy to figure out:
-
or wether or not this LM always has to work in a causal left-to-right manner, or it can also be used to do "inpainting" of residues in the middle of a sequence...
Finally, the way I'm currently evaluating mutations is by sequentially computing sequence likelihoods for each possible mutated sequence, so this takes 20 forward passes per single point mutation. But I think this is vastly inefficient, since the model produces logits for every position, can the logits for the target index simply be used as a proxy for token probability?
a couple notes:
- we excluded the non-amino acid tokens when scoring sequences as they aren't relevant for variant prediction. it doesn't have that large of an effect however
- the LMs released currently are traditional autoregressive decoders. left-to-right or right-to-left. there are ways to perform inpainting but would require restructuring/retraining
- you can use the logits (or averaged logits from the N->C and C->N direction) for a target index but i've never validated this myself. it would be approximate and not fully use the remaining context of the protein which may be critical