Tokenising multi-chain proteins
Closed this issue · 2 comments
exs-hkenlay commented
Thank you for this work and open sourcing these models.
I have a question about how you pre-processed proteins with multiple chains when preparing training data. Given a protein with multiple chains did you consider each chain as a separate inputs, or did you make use of the separation token (e.g. [SEP]
in Bert and </s>
in T5) to indicate different chains on the same line?
mheinzinger commented
Sorry to have bad news for you: our model saw only "single chain" proteins as we simply took protein sequences from UniProt/UniRef and BFD/metagenomic_data .
exs-hkenlay commented
Thanks for clarifying @mheinzinger