Tokenising multi-chain proteins

Question

Tokenising multi-chain proteins

Closed this issue a year ago · 2 comments

Thank you for this work and open sourcing these models.

I have a question about how you pre-processed proteins with multiple chains when preparing training data. Given a protein with multiple chains did you consider each chain as a separate inputs, or did you make use of the separation token (e.g. [SEP] in Bert and </s> in T5) to indicate different chains on the same line?

Answer 1 · 2023-10-18T13:06:37.000Z

Sorry to have bad news for you: our model saw only "single chain" proteins as we simply took protein sequences from UniProt/UniRef and BFD/metagenomic_data .

Answer 2 · 2023-10-18T13:09:48.000Z

Thanks for clarifying @mheinzinger