- We want to let the lab know what proteins can have a higher stability in order to achieve a faster iterations.
- In order to achieve this, we can build a ML-based regressor that predicts protein stability given the aminoacid sequence.
- Nevertheles, labelled protein sequence data is scarce. More so, when we are interesed on stability annotations.
- With the two available datasets, Rocklin et al. 2017 and Høie et al. 2021, we want to determine the minimum amount of input data that is needed to train ML-based protein stability regressors.
- If we use protein dataset with a higher degree of diversity, we can achieve better results with less data. Consequently, our control case will be based on random sampled subsets.
-
We will experiment with different subsets determined by two main sampling techniques:
- The first technique will to maximize diversity. Subset size can be determined by two factors: a minimum diversity score - e.g. our samples cannot have diversity score lower than
$x$ - and maximum size given by a percent of the original dataset -e.g. our subset cannot have a size bigger than $|X| *$x$ -. - The second technique will create random sampled subsets that cannot have size bigger than a percent of the original dataset.
- The first technique will to maximize diversity. Subset size can be determined by two factors: a minimum diversity score - e.g. our samples cannot have diversity score lower than
-
Given the previous point, we need to do the achieve the following intermediate objectives: 0. Perform EDA to gain a sense of what our data is about.
- Compute a diversity score for every sample.
- Extract pretrained embeddings as feature descriptors.
- Create new dataset files where we can store our embeddings as 1 dimensional arrays.
- Write custom sampling techniques.
- Create different experiment setups.
-
As we have two datasets, we will also explore our model's transfer learning capabilities.
- Alignment Scores: Mathematically similar to Hamming Distance but more versatile.
- Score per different metric (length, uniqueness of aminoacids)
- Create batched dataloader for a given dataset. The batches must have the following form:
[(label, sequence), ...]
. - Retrieve token embeddings for a batch of sequences.
- Reduce the token embeddings to sequence embeddings (average over sequence dimension).
- We have embeddings per sequence.
Shorthand | Paper | Dataset | Description | Use |
---|---|---|---|---|
parallel | Rocklin et al. 2017 | - | 1D Protein Sequences with the custom stability scores(?). | Input Data to ESM1-b |
mutagenesis | Høie et al. 2021 | - | 1D Protein Sequences with their ddG values annotated by Rosseta. | Input Data to ESM1-b + Transfer Learning |
ESM-1b | Rives et al. 2019 | UR50 | Pretrained SOTA general-purpose protein language model. Can be used to predict structure, function and other protein properties directly from individual sequences. | We pass protein sequences as input and extract embeddings that are used as feature descriptors. We attempt to train one or more models that can predict protein stability. |
MSA Transformer | [Rao et al. 2021]https://doi.org/10.1101/2021.02.12.430858 ) | - | TODO | They report improved results when maximizing diversity as Hamming Distance. |