KeyError: '!' during conditional generation
cmwilson252 opened this issue · 5 comments
Hello, I am experiencing a key error attempting to use the code 1 for 1 from the example notebook for conditional generation:
https://github.com/microsoft/evodiff/blob/main/examples/evodiff.ipynb
from evodiff.pretrained import MSA_OA_DM_MAXSUB
from evodiff.generate_msa import generate_query_oadm_msa_simple
import re
checkpoint = MSA_OA_DM_MAXSUB()
model, collater, tokenizer, scheme = checkpoint
path_to_msa = 'bfd_uniclust_hits.a3m'
n_sequences=64 # number of sequences in MSA to subsample
seq_length=256 # maximum sequence length to subsample
selection_type='random' # or 'MaxHamming'; MSA subsampling scheme
tokeinzed_sample, generated_sequence = generate_query_oadm_msa_simple(path_to_msa, model, tokenizer, n_sequences, seq_length, device='cpu', selection_type=selection_type)
print("New sequence (no gaps, pad tokens)", re.sub('[!-]', '', generated_sequence[0][0],))
The error can be traced back to:
evodiff/utils.py, line 247, in
return np.array([self.a_to_i[a] for a in seq[0]]) # for nested lists
The alphabet seems to not know how to handle ! which should be the padding token. This alphabet appears to be imported from sequence_models.constants as MSA_ALPHABET.
Also this is much less important but I noticed there's three instances of "tokeinzed_sample" as a variable name in the example notebook that almost certainly are meant to be "tokenized_sample"
If you're struggling to install EvoDiff locally, feel free to try https://www.tamarind.bio/evodiff, a website which offers a no-code interface for bioinformatics tools including protein design with EvoDiff for free.
@cmwilson252 did you end up finding a solution? i am experiencing the same problem now
Note this is fixed by reducing n_sequences =
to a number <= sequences in your MSA
so does it must be .a3m file to input?
yes, it must be .a3m file!!!