oxpig/CaLM

AlphaBet

Opened this issue · 1 comments

Hi,

Thank you for your work.

I noticed this is trained on cDNA data, while the tokeniser seems to use RNA vocab (https://github.com/oxpig/CaLM/blob/main/calm/alphabet.py)

Can you please clarify the data preprocessing pipeline?

Hi,

I think you can read https://github.com/oxpig/CaLM/blob/main/calm/sequence.py this script, they defined a class CodonSequence to replace 'T' with 'U'.