Sampling conditional token distribution

Question

Sampling conditional token distribution

aiXander opened this issue 3 years ago · 1 comments

It would be super valuable to have an example script to sample conditional token probabilities for a target index given sequence context.

There seem to be some technical details that are important, but not easy to figure out:

eg not all tokens being actually used
or wether or not this LM always has to work in a causal left-to-right manner, or it can also be used to do "inpainting" of residues in the middle of a sequence...

Finally, the way I'm currently evaluating mutations is by sequentially computing sequence likelihoods for each possible mutated sequence, so this takes 20 forward passes per single point mutation. But I think this is vastly inefficient, since the model produces logits for every position, can the logits for the target index simply be used as a proxy for token probability?

Answer 1 · 2022-07-20T21:01:17.000Z

a couple notes:

we excluded the non-amino acid tokens when scoring sequences as they aren't relevant for variant prediction. it doesn't have that large of an effect however
the LMs released currently are traditional autoregressive decoders. left-to-right or right-to-left. there are ways to perform inpainting but would require restructuring/retraining
you can use the logits (or averaged logits from the N->C and C->N direction) for a target index but i've never validated this myself. it would be approximate and not fully use the remaining context of the protein which may be critical