Guys, please update this file with all the necessary information (what you've done) after you do something new in the project. This will be useful for writing final report.
We've done BLAST search on this page (https://www.uniprot.org/blast/), providing our sequence (EAAYDFPGSGSSSELPLKKGDIVFISRDEPSGWSLAKLLDGSKEGWVPT) as input. We've used BLOSUM-62 substitution matrix and Uniref90 database. E-Threshold value was set to 0.01.
OUTPUT: BLAST_uniprot.fasta
I've generated MSA from BLAST_uniprot.fasta using Clustal Omega (https://www.uniprot.org/align/).
OUTPUT: MSA_uniprot.fasta
The MSA was then manually edited.
OUTPUT: MSA_uniprot_edited.fasta
I've used the following command:
# Create a PSSM from a Fasta MSA (the content of the file in the -subject option is irrelevant)
psiblast -subject BLAST_uniprot.fasta -in_msa MSA_uniprot_edited.fasta -out_pssm models/MSA_uniprot_model.pssm
OUTPUT: MSA_uniprot_model.pssm
I've use the following command:
# Build a HMM from the MSA with hmmbuild
hmmbuild models/MSA_uniprot_model.hmm MSA_uniprot_edited.fasta
OUTPUT: MSA_uniprot_model.hmm
5) "Evaluate your model against human proteins available in SwissProt (accuracy, precision, sensitivity, specificity, MCC)."
a) Define you ground truth/reference by finding all human proteins in SwissProt annotated (and not annotated) with the assigned Pfam ID (provided). Pfam annotations are available from UniProt.
To obtain all human proteins containing our domain we have searched Uniprot with query:
pf00018 AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]"
and found out that there are 101 proteins.
OUTPUT: Swiss_Human/PF00018_human.fasta
We then downloaded all human protein from SwissProt:
reviewed:yes AND organism:"Homo sapiens (Human) [9606]"
OUTPUT: Swiss_Human/Swiss_human.fasta
We have then created the database with the following command
makeblastdb -dbtype prot -in Swiss_Human/Swiss_human.fasta -parse_seqids
Search with psiblast using the previously generated PSSM
psiblast -in_pssm models/MSA_uniprot_model.pssm -db Swiss_Human/Swiss_human.fasta -num_iterations 1 -evalue 0.001 > results/psiblast_out.txt
Search with hmmsearch using the previously generated HMM
hmmsearch --domtblout results/hmmsearch.hmmer_domtblout models/MSA_uniprot_model.hmm Swiss_Human/Swiss_human.fasta > results/hmmsearch_out.hmmer_align
The results are in the folder Dataset
We used the HMM to retrieve the proteins from the SwissProt human database
We retrieved Pfam domains of all human proteins in SwissProt from Uniprot, we then filtered the proteins matching the ones in our original dataset and created, for each possible domains combination, a new dataset containing the proteins made up by that combination.
The code is in Architecture_datasets.ipynb
Starting from the original dataset we retrieved the PDB entries for each protein, we did the same thing for all the proteins present in the SwissProt database. Not all the human proteins in Uniprot, because its rare to find a protein which has a PDB entry and it's not in SwissProt.
We then added all the proteins not present in the original database which are found as other chains in the same PDB.
The code is in PDB_dataset.ipynb
From string-db.org we chosen the multiple sequences mode, then copied all the Uniprot id of the proteins in the original dataset in the form.
Then we downloaded it in fasta format (string_protein_sequences.fasta
), and then with string_dataset.ipynb
we retrieved all the STRING ids and with uniprot.org
we translated it into Uniprot ids.
From uniprot we downloaded the proteins in fasta format string_converted.fasta
. Finally with python we added all the new proteins not
present in the original dataset that interact with one of the proteins in the original dataset.
"Provide a statistics about the CATH architectures mapping to your domain."
We went to CATH db and searched for our domain - PF00018. We found out that our domain is formed by one architecture - Roll (CATH ID: 2.30). Apart from our domain, this acrhiteture is also present in 9827 other domains.
"Retrieve all PDBs covering your domain (if any) and evaluate their structural similarity."
We retrieved from PDB all structures that cover our domain (PF00018) and that belong to Homo Sapiens organism. 128 such structures were found and downloaded as .pdb files. Then we've created a script which performs all-vs-all structural alignment using TM-align command line software. This resulted in 8128 alignment files. Then, we retrieved TM-score (normalized by average length of chains) from each of these files and constructed 2D distance matrix for all pairs of structures. From this 2D matrix we created 2 dendograms using 2 methods - Nearest point algorithm and UPGMA. All results can be found in folder pdb_structural_similarity (file TMAlign.ipynb).