This script computes conservation score per aminoacid for a protein.
Many thanks to authors of BLAST+ suite, CD-HIT, MUSCLE and Jensen-Shannon divergence method. This script is based on ConSurf DB and ConCavity paper
- Download BLAST+ suite from here.
- Install/unpack the BLAST+ archive and make sure your directory is PATH.
- Download CD-HIT from here.
- Unpack CD-HIT archive, run make command and make sure cd-hit is in PATH.
- Download MUSCLE from here.
- Unpack MUSCLE archive and add the directory to PATH.
- Download python script for Jensen-Shannon divergence method from here.
- Finally, unpack the pipeline archive from this repository.
- Copy the archive contents to the directory where Python script for Jensen-Shannon divergence (
score_conservation.py
) is located.
- Create a directory where you want to store the databases.
- Add environmental variable
BLASTDB={path to the directory}
- Change directory to BLASTDB directory and run these commands. Each command will download and parse one database. We recommend to use SwissProt and UniRef90, but using TrEBML instead of UniRef90 is also possible. Note that it will take a lot of time.
curl {UniRef90 URI} | gunzip | makeblastdb -out uniref90 -dbtype prot -title UniRef90 -parse_seqids
curl {SwissProt URI} | gunzip | makeblastdb -out swissprot -dbtype prot -title SwissProt -parse_seqids
curl {TrEMBL URI} | gunzip | makeblastdb -out trembl -dbtype prot -title TrEMBL -parse_seqids
- cd to the directory where you copied
calc_conservation.sh
. - Run this command:
./calc_conservation.sh {path to your fasta file}
PSIBLAST is ran with your fasta input on SwissProt database (1 iteration, eval=1e-5).
The result is filtered (min 80% coverage and seq identity between 30% and 95%), then clustered with CD-HIT (default parameters).
If we are left with less then 50 hits, repeat the same for UniRef90 database.
Run MUSCLE and Jensen-Shannon divergence on the query hits.