Search Engine for Remote Homologous Proteins


Identifying proteins with similar structures in remote sequences is a difficult
undertaking. To address this issue, scientists have created a range of
techniques for executing remote homology searches. The aim of this project is
to create a deep learning-based algorithm that can identify up to 12 proteins
with similar structures from the Protein Data Bank (PDB) that are homologous to
a given protein sequence query.

The quality of the pairing between the query and the candidate proteins is
assessed by computing the TM-score and SEQID between the query structure and
the paired PDB structure using the TMalign program (normalized by the query
sequence length). The final score is determined as follows:

        (TM-score - 0.6) + min(0.4 - SEQID, 0).

The effectiveness of the algorithm is evaluated by summing the total score of
all the query-candidate pairs.


#### query.fasta ####

This input file contains 1024 protein sequences that are to be used as queries.
Your program should take the file (in the same format but with different data)
as input and return up to 12 proteins that are similar to each query sequence.


#### /data/pdb ####

This folder holds the Protein Data Bank of protein structures that can be
searched. The PDB files included can be accessed by a variety of Python
packages, including Graphein and BioPython.


#### tmalign.out ####

This training file has some potentially successful protein pairs with a
TM-score greater than 0.6 and a SEqID lower than 0.4.


#### result.out ####

This example output file has up to 12 potential proteins for each query
protein. Your program should generate output with the same format.


#### submission requirements ####

The final program should be submitted as a Docker (refer to docker directory).