Sequence processing

Question

Sequence processing

Closed this issue a year ago · 6 comments

Hi there!

Have a quick question regarding sequence tokenization

If I am tokenizing sequence is it necessary that I convert the U, Z and O to X as done in
https://github.com/agemagician/ProtTrans/blob/master/Embedding/prott5_embedder.py#L90

Thank you,
Sharmi

Answer 1 · 2023-07-05T06:58:13.000Z

Hi :)
no, it does not have to be done. However, those tokens are then mapped to the unknown token or (some of our models/tokenizers still have those tokens), the resulting embedding won't be very meaningful given how rarely the model encountered them during training.

Answer 2 · 2023-07-05T23:37:58.000Z

Gotcha thank you!

One more question (because I am here!) do you have the secondary structure prediction fine tune for ProtT5?

I don't think I can use AutoModelForTokenClassification for T5 so may be have to create a head along with the encoder as backbone?

Thank you!

Answer 3 · 2023-07-06T06:45:30.000Z

Nope, we have no version of ProtT5 finetuned on secondary structure.
But I think you are perfectly on the right track: I would also just put a custom head on top of the Encoder model and finetune from there. Minor side-remark: we tried this at one point and did not improve over keeping the encoder frozen and training a small CNN on top. So plain finetuning of all parameters did not seem to be the way to go. If I had to redo this now, I would probably rather go for is something like this: https://github.com/r-three/t-few/tree/master

Answer 4 · 2023-07-06T13:52:33.000Z

Awesome I will be using LoRA!!Thanks so muchOn Jul 5, 2023, at 11:45 PM, Michael Heinzinger ***@***.***> wrote: Nope, we have no version of ProtT5 finetuned on secondary structure. But I think you are perfectly on the right track: I would also just put a custom head on top of the Encoder model and finetune from there. Minor side-remark: we tried this at one point and did not improve over keeping the encoder frozen and training a small CNN on top. So plain finetuning of all parameters did not seem to be the way to go. If I had to redo this now, I would probably rather go for is something like this: https://github.com/r-three/t-few/tree/master —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 5 · 2023-07-06T20:08:13.000Z

One last question, in the ProtTrans paper for secondary structure you guys showed that T5 outperformed all models. Was that the full encoder-decoder model and used as T5ForConditionalGeneration for token classification? I noticed similar approach for general T5 model https://colab.research.google.com/drive/1obr78FY_cBmWY5ODViCmzdY6O1KB65Vc?usp=sharing
so wanted to double check the results on the paper.

Thank you!

Answer 6 · 2023-07-07T06:36:57.000Z

We always only used the encoder-only model for any predictive downstream task.
You only need the decoder if you want to derive e.g. log-odds or actually generate sequences.