agemagician/ProtTrans

Tokenising multi-chain proteins

Closed this issue · 2 comments

Thank you for this work and open sourcing these models.

I have a question about how you pre-processed proteins with multiple chains when preparing training data. Given a protein with multiple chains did you consider each chain as a separate inputs, or did you make use of the separation token (e.g. [SEP] in Bert and </s> in T5) to indicate different chains on the same line?

Sorry to have bad news for you: our model saw only "single chain" proteins as we simply took protein sequences from UniProt/UniRef and BFD/metagenomic_data .

Thanks for clarifying @mheinzinger