THUDM/ProteinLM

format of the sequence json file, which one?

usccolumbia opened this issue · 1 comments

which format should be sequence json file? do we need to add spaces between amino acids?

in:
https://github.com/THUDM/ProteinLM/tree/main/pretrain
{"text": "GCTVEDRCLIGMGAILLNGCVIGSGSLVAAGALITQ"}
{"text": "RTIKVRILHAIGFEGGLMLLTIPMVAYAMDMTLFQAILLDLSMTTCILVYTFIFQWCYDILENR"}

https://github.com/THUDM/ProteinLM/tree/main/pretrain/protein_tools
{"text": "G C T V E D R C L I G M G A I L L N G C V I G S G S L V A A G A L I T Q "}
{"text": "A D G I N L E I P R G E W I S V I G G N G S G K S T F L K S L I R L E A V K K G R I Y L E G R E L K K W S D R T L Y E K A G F V F Q N P E L Q F I R D T V F D E I A F G A R Q R S W P E E Q V E R K T A E L L Q E F G L D G H Q K A H P F T L S L G Q K R R L S V A T M L L F D Q D L L L L D E P T F "}

Hi @usccolumbia,

The second one is correct. I have fixed it in #11 .

Thanks for your issue!


Best,
Yijia