ProteinLearn

ProteinLearn makes complex protein data understandable for diverse user groups, including students, drug designers, and researchers.

Motivation

AlphaFold, ESM, and other protein LLMs have made significant progress in predicting 3D structures given amino acid sequences. However, they are unable to provide human-understandable language to further explain the functionality of the protein.
ProteinChat has been proposed to help us understand protein structures through natural language. However, it still suffers from hallucination when encountering proteins not included in the training dataset.
This problem can be circumvented by introducing hallucination-free LLMs from the Hallucinations Leaderboard or by using a RAG system.

Amino acid squences together with the proteins' 3D structure are encoded together to provide comprehensive information.
The adapter is trained to form soft prompt as input for the downstream LLMs to generate explanation