Trained a RAG Agent on Biological research papers to take in protein sequences and output an additional 35 columns of features derived from protein sequences. This project is mostly a proof of concept as to be cost effective, I only ran the model on 100 rows of the 30,000 row dataset.
To improve this project, instead of using a generalized LLM to calculate features, I would create a center management agent that took in all the columns referred to build from the specialized biology consulting agent. Then the Management Agent would pass the specific columns onto specific expert agents. There would be one agent with a calculator tool trained on calculating specifically molecular weight, LogP, and other calculated biology features. I would have another agent specifically trained on hydrogen bonds along with hydrophobic and hydrophilic regions of proteins, another agent trained on deriving protein shapes from sequences, and another agent trained on calculating polarity. In this project I demonstrate the ability to do this and a proof of concept on how to transform these answers into a useable dataframe.
In this project, I also convert protein sequences into 2d images of the protein and save those images to corresponding folders.
Finally the proposed model infrastructure for the input of the data sources (LLM engineered data source and 2D images) is a model that takes in 4 images (of each protein in the chain) and runs these through a pretrained efficientnet model then concatenates the weights of these 4 convolutional heads. On the other end the LLM generated dataframe is ran through a Multilayered Neural Network with weights concatenated together with the previous Convolutional weights. The final weights then travel through Fully Connected layers to make a prediction. Due to cost effectiveness of filling in a dataframe by the LLM I was only able to test the model on 100 rows of data which was not enough to learn the features I would have liked the model to learn.
With my proposed improvements on the data collection I expect results would have been much more consistent with PCM or HyperPCM methods of predicting protein bonds.