snap-stanford/SATURN

question about the input sequences for esm

Closed this issue · 1 comments

Hi! Thank you for your nice tool!

I noticed that before using the esm model, you removed the sequences that with '*' (stop codons). May I ask why do we need to removed all such kind of sequences? Is this step necessary?

Thanks!

I believe that these were removed because that character was not in the ESM vocabulary. For the ensembl proteomes, I think the stop codon is implied at the end of the sequence.

I am not sure, but I don't think it is necessary.