AlphaBet
Opened this issue · 1 comments
ZhiyuanChen commented
Hi,
Thank you for your work.
I noticed this is trained on cDNA data, while the tokeniser seems to use RNA vocab (https://github.com/oxpig/CaLM/blob/main/calm/alphabet.py)
Can you please clarify the data preprocessing pipeline?
Cassie818 commented
Hi,
I think you can read https://github.com/oxpig/CaLM/blob/main/calm/sequence.py this script, they defined a class CodonSequence to replace 'T' with 'U'.