agemagician/ProtTrans

Sequence processing

Closed this issue · 6 comments

Hi there!

Have a quick question regarding sequence tokenization

If I am tokenizing sequence is it necessary that I convert the U, Z and O to X as done in
https://github.com/agemagician/ProtTrans/blob/master/Embedding/prott5_embedder.py#L90

Thank you,
Sharmi

Hi :)
no, it does not have to be done. However, those tokens are then mapped to the unknown token or (some of our models/tokenizers still have those tokens), the resulting embedding won't be very meaningful given how rarely the model encountered them during training.

Gotcha thank you!

One more question (because I am here!) do you have the secondary structure prediction fine tune for ProtT5?

I don't think I can use AutoModelForTokenClassification for T5 so may be have to create a head along with the encoder as backbone?

Thank you!

Nope, we have no version of ProtT5 finetuned on secondary structure.
But I think you are perfectly on the right track: I would also just put a custom head on top of the Encoder model and finetune from there. Minor side-remark: we tried this at one point and did not improve over keeping the encoder frozen and training a small CNN on top. So plain finetuning of all parameters did not seem to be the way to go. If I had to redo this now, I would probably rather go for is something like this: https://github.com/r-three/t-few/tree/master

One last question, in the ProtTrans paper for secondary structure you guys showed that T5 outperformed all models. Was that the full encoder-decoder model and used as T5ForConditionalGeneration for token classification? I noticed similar approach for general T5 model https://colab.research.google.com/drive/1obr78FY_cBmWY5ODViCmzdY6O1KB65Vc?usp=sharing
so wanted to double check the results on the paper.

Thank you!

We always only used the encoder-only model for any predictive downstream task.
You only need the decoder if you want to derive e.g. log-odds or actually generate sequences.