/proteinet-cafa5

We took a look at the competition context and developed a first model for this 5th edition of the CAFA competition named proteiNet. Following this, we implemented various additions to build its big brother: ProteiNet v2! This new version actually aims to train not one, not two, but 3 models, all three specialized in predicting a aspect group of GO

Primary LanguagePython

We took a look at the competition context and developed a first model for this 5th edition of the CAFA competition named proteiNet : https://www.kaggle.com/code/henriupton/proteinet-pytorch-ems2-t5-protbert-embeddings

Following this, we implemented various additions to build its big brother: ProteiNet v2! This new version actually aims to train not one, not two, but 3 models, all three specialized in predicting a group of GOs of a particular aspect among the three sets presented for CAFA5: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC).

The first section of proteiNet v2 is dedicated to the training part of the models. If you want to have a look into the inference section, follow this link : https://www.kaggle.com/code/henriupton/proteinet-aspects-experts-infer

The second section of proteiNet v2 is dedicated to the inference part from the models trained in first section. If you want to have a look into the inference section, follow this link : https://www.kaggle.com/code/henriupton/proteinet-v2-inference-notebook

Feel free to give feedback for improvement !

1. Problem Framing

1.1. What is CAFA ?

CAFA stands for Critical Assessment of Functional Annotation. This Kaggle competition aims to predict the function of proteins using their amino-acid sequences and additional data. Understanding protein function is crucial for comprehending cellular processes and developing new treatments for diseases. With the abundance of genomic sequence data available, assigning accurate biological functions to proteins becomes challenging due to their multifunctionality and interactions with various partners. This competition, hosted by the Function Community of Special Interest (Function-COSI), brings together computational biologists, experimental biologists, and biocurators to improve protein function prediction through data science and machine learning approaches. The goal is to contribute to advancements in medicine, agriculture, and overall human and animal health.

1.2. What to submit ?

This competition evaluates participants' predictions of Gene Ontology (GO) terms for protein sequences. The evaluation is performed on a test set of proteins that initially have no assigned functions but may accumulate experimental annotations after the submission deadline. The test set is divided into three subontologies: Molecular Function (MF), Biological Process (BP), and Cellular Component (CC).

image-intro

2. General Baseline

  • Collect Embedding vectors from pre-trained protein function prediction models (T5, ProtBERT or EMS2). Sources for embeddings vectors : T5, ProtBERT,EMS2

  • Generate labels from train_terms file : by considering the top K most common GO terms in all Proteins set, generate for each protein a sparse vector of length K to indicate the true probabilities that each of the K GO terms are in the Protein (0 or 1). Here we retain K = 600

  • Create Pytorch Dataset class that can handle Protein ID and embeddings.

  • Create Pytorch Model class for prediction : can be any architecture of Multilabel classification model that can turn embeddings of shape (E,) to probabilities of shape (K,). Here we explore MultiLayerPerceptron (MLP).

  • Make Cross Validation w.r.t the F1 measure and do Hyperparameter tuning.

2.2. ProteiNet v2 : New features

Thanks to the great interest shown in the notebook dedicated to ProteiNet v1, a large number of bugs and defects have been corrected in this new version. On the other hand, my team (M. Sato, F. Lin and myself) have tried to innovate as much as possible and incorporate various topics of discussion from the competition for ProteiNet v2. Here is an exhaustive list of the most important innovations: