manastech/cafa5

Function to get one-hot-encoded protein sequence

Closed this issue · 0 comments

To input protein sequences in our NN, we will need to encode their sequence in a suitable format.
The most basic way is to one-hot-encode each position and convert it in a NumPy array of 20 positions: there is an alphabet of 20 possible amino acids, we put a 1 in the position corresponding to the actual amino acid, 0s in the other 19 positions.

I.e. the final encoding of a protein of 200 amino acids long will be then an array of shape (200,20).

The protein sequence is just another element within Uniprot's protein XML.

extra
Depending on how far we go in this project we could use other encodings like the ones described here, but for one the one-hot should be ok ─I've tested it in my dummy model and it worked ok, but is memory consuming and the GPU memory is scarce an expensive :(─