Curiosity about the model's structure
deweihu96 opened this issue · 1 comments
deweihu96 commented
Dear ProLLaMA authors,
Thank you for presenting this interesting work!
I'm just out of curiosity about some parts of your model. Why would you add "protein sequence languages" to LLaMA2, instead of treating it as a translation task? Like, the encoder takes human languages as input, and the output only gives protein sequences, or vice versa. Or probably using embedding fusion/alignment methods, like image generation? If you have tried these methods, could you explain a bit about it?
Best,
Dewei
Lyu6PosHao commented
Thanks for your attention! It is a good question.
- About "translation": T5 is a representative in this context. We think autoregressive LM (e.g. LLM like Lllama2) takes some advantages over T5:
The pretraining task might be more effective in generative scenarios. (LLM's causal mask modeling vs T5's span mask)
Some tasks are not strictly translation from one language to another. So LLM maybe more flexible. - About "embedding fusion":
As protein sequences and natural language texts are both 1D tokens, it is a concise and effect way to let LLM directly learn them. Thus we can focus more on the constuction of the dataset, since models are data-hungry.
Although we think embedding fusion might be unnecessary for protein sequences, it is significant for non-1D data, such as image, protein structure, graphs.
Keep in touch if you have any further question.