ConSurf
Link to Elixir
Taken from http://consurf.tau.ac.il/overview.html
The ConSurf server [1] is a bioinformatics tool for estimating the evolutionary conservation of amino/nucleic acid positions in a protein/DNA/RNA molecule based on the phylogenetic relations between homologous sequences. The degree to which an amino (or nucleic) acid position is evolutionarily conserved is strongly dependent on its structural and functional importance; rapidly evolving positions are variable while slowly evolving positions are conserved. Thus, conservation analysis of positions among members from the same family can often reveal the importance of each position for the protein (or nucleic acid)'s structure or function. In ConSurf, the evolutionary rate is estimated based on the evolutionary relatedness between the protein (DNA/RNA) and its homologues and considering the similarity between amino (nucleic) acids as reflected in the substitutions matrix [2,3]. One of the advantages of ConSurf in comparison to other methods is the accurate computation of the evolutionary rate by using either an empirical Bayesian method or a maximum likelihood (ML) method [3].
-
Glaser, F., Pupko, T., Paz, I., Bell, R.E., Bechor-Shental, D., Martz, E. and Ben-Tal, N. (2003) Bioinformatics, 19, 163-164.
-
Mayrose,I., Graur,D., Ben-Tal,N. and Pupko,T. (2004) Mol. Biol. Evol., 21, 1781-1791.
-
Pupko,T., Bell,R.E., Mayrose,I., Glaser,F. and Ben-Tal,N. (2002) Bioinformatics, 18 S71-S77.
HOWTO Install
Below You will find the guidlines on how to succesfully install ConSurf on your machine. Current manual is designed for Debian OS.
Installing libraries
In order for ConSurf to work properly please, install the following libraries:
autoconf
g++
libexpat-dev
libtool
perl
or simply run
sudo apt-get install autoconf g++ libexpat-dev libtool perl -y
Perl and BioPerl
It is mandatory to have perl installed on your machine. After Perl is installed please run the following commands in order to download and install perl modules
cpan
in CPAN run the following commands
sudo cpan force install Bio::Perl Config::IniFiles List::Util
3rd Party Software
ConSurf uses several programs to calculate the outputs. The Copyrights to these programs belongs to their owner. PLEASE MAKE SURE YOU FOLLOW THE LICENSE BY EACH OF THESE PROGRAMS.
The software can be found in the folder 3rd party software
. Here are the instructions to install the software itself.
Rate4Site
Rate4Site - is the main algorithm behind ConSurf.
Installation:
The sources are already in the folder 3rd party software
. ConSurf requires the fast and slow versions. In order to obtain both of them simply run rate4site_build.sh
script:
./rate4site_build.sh
ClustalW
ClustalW - please download from EBI; ftp://ftp.ebi.ac.uk/pub/software/clustalw2/
In 3rd party software
folder You can already find the executable.
blastpgp
blastpgp can be found in 3rd party software/blast-2.2.26/bin
CD-HIT
cd-hit can be found in 3rd party software/cdhit
To compile simply run:
./cd-hit_build.sh
("Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-9)
MUSCLE MSA program
can be found in 3rd aprty software
("MUSCLE: a multiple sequence alignment method with reduced time and space complexity". Edgar R.C. (2004), BMC Bioinformatics 5: 113.)
Databases
Please download SwissProt and TrEMBL databases and format the databases for Blast search (using blast formadb which is part of blast system. manual )
- update the location of SwissProt to point the location of
SWISSPROT_DB
- update the location of TrEMBL to point the location of
UNIPROT_DB
.
Databases can be downloaded in FASTA format from [here] (http://www.uniprot.org/downloads) or just run the following commands:
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
gunzip uniprot_trembl.fasta.gz
Configuration
In order to run ConSurf is necessary to update the configuration file - consurfrc.default
- In
[programs]
section please enter the full path to the executables of 3rd party software covered earlier (Example:MUSCLE=\home\path_to_ConSurf\3rd party software\muscle3.8.31
) - in
[databases]
section please enter the full path to the DBs. (Example:SWISSPROT_DB=\home\path_to_DB_Folder\uniprot_sprot.fasta
)
Installing ConSurf
To install ConSurf run the following commands in ConSurf folder:
aclocal
automake --force-missing --add-missing
autoconf
./configure
sudo make
sudo make install
HOWTO Run
The ConSurf works in several modes
- Given Protein PDB File.
- Given Multiple Sequence Alignment (MSA) and Protein PDB File.
- Given Multiple Sequence Alignment (MSA), Phylogenetic Tree, and PDB File.
The script is using the user provided MSA (and Phylogenetic tree if available) to calculate the conservation score for each position in the MSA based on the Rate4Site algorithm [2].
When running in the first mode, the script builds the MSA for the given protein based on the ConSurf protocol.
Usage:
consurf -PDB <PDB FILE FULL PATH> -CHAIN <PDB CHAIN ID> -Out_Dir <Output Directory>
Requirements for successful run:
- User must have
<PDB_ID>_<CHAIN_ID>_protein_seq.fas
in the directory specified as-workdir
. The file should contain the protein sequence in FASTA format. More information and the solution is available in Issue #14 - User is strongly encouraged to specify the
<PDB_ID>_<CHAIN_ID>.protein_query.blast
file. It is created in the directory specified as-workdir
by the ConSurf, but sometimes ConSurf might fail to find it. More information and the solution is available in Issue #17
MANDATORY INPUTS
-PDB <PDB FILE FULL PATH> - PDB File
-CHAIN <PDB CHAIN ID> - Chain ID
-Out_Dir <Output Directory> - Output Path
MSA Mode (Not Using -m)
-MSA <MSA File Name>
(MANDATORY IF -m NOT USED)
-SEQ_NAME <"Query sequence name in MSA file">
(MANDATORY IF -m NOT USED)
-Tree <Phylogenetic Tree (in Newick format)>
(optional, default building tree by Rate4Site).
Building MSA (Using -m)
-m Builed MSA mode
-MSAprogram ["CLUSTALW"] or ["MUSCLE"] # default: MUSCLE
-DB ["SWISS-PROT"] or ["UNIPROT"] # default: UniProt
-MaxHomol <Max Number of Homologs to use for ConSurf Calculation> #deafult: 50
-Iterat <Number of PsiBlast iterataion> # default: 1
-ESCORE <Minimal E-value cutoff for Blast search> # default: 0.001
-BlastFile <pre-calculated blast file provided by the user>
Rate4Site Parameter
(see http://consurf.tau.ac.il/overview.html#methodology and http://consurf.tau.ac.il/overview.html#MODEL for details)
-Algorithm [LikelihoodML] or [Bayesian] # default: Bayesian
-Matrix [JTT] or [mtREV] or [cpREV] or [WAG] or [Dayhoff] # default JTT
Examples
- Basic Build MSA mode (using defaults parameters)
consurf -PDB example/1lk2.pdb -CHAIN A -Out_Dir output -m
Example:
cd ConSurf
consurf -PDB example/1lk2.pdb -CHAIN A -Out_Dir output -m --workdir workdir -BlastFile example/1lk2_A.protein_query.blast
Method Description
Taken from http://consurf.tau.ac.il/overview.html
Given the amino or nucleic acid sequence (can be extracted from the 3D structure), ConSurf carries out a search for close homologous sequences using BLAST (or PSI-BLAST) [4,5]. The user may select one of several databases and specify criteria for defining homologues. The user may also select the desired sequences from the BLAST results. The sequences are clustered and highly similar sequences are removed using CD-HIT [6]. A multiple sequence alignment (MSA) of the homologous sequences is constructed using MUSCLE(default) or CLUSTALW. The MSA is then used to build a phylogenetic tree using the neighbor-joining algorithm as implemented in the Rate4Site program [7]. Position-specific conservation scores are computed using the empirical Bayesian or ML algorithms [2,3]. The continuous conservation scores are divided into a discrete scale of nine grades for visualization, from the most variable positions (grade 1) colored turquoise, through intermediately conserved positions (grade 5) colored white, to the most conserved positions (grade 9) colored maroon. The conservation scores are projected onto the protein/nucleotide sequence and on the MSA.
-
Glaser, F., Pupko, T., Paz, I., Bell, R.E., Bechor-Shental, D., Martz, E. and Ben-Tal, N. (2003) Bioinformatics, 19, 163-164.
-
Mayrose,I., Graur,D., Ben-Tal,N. and Pupko,T. (2004) Mol. Biol. Evol., 21, 1781-1791.
-
Pupko,T., Bell,R.E., Mayrose,I., Glaser,F. and Ben-Tal,N. (2002) Bioinformatics, 18 S71-S77.
-
Altschul,S.F., Wootton,J.C., Gertz,E.M., Agarwala,R., Morgulis,A., Schaffer,A.A. and Yu,Y.K. (2005) FEBS J., 272, 5101-5109.
-
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, 3389-3402.
-
Li,W. and Godzik,A. (2006) Bioinformatics, 22, 1658-1659.
-
Pupko, T., Bell, R.E., Mayrose, I., Glaser, F. and Ben-Tal, N. (2002) Bioinformatics, 18, S71-77.
HOWTO Using Vagrant
In the folder vagrant_ConSurf
You can find the Vagrantfile, file necessary to run a Vagrant Box with ConSurf on it.
Intallation of Vagrant will not be explained in here.
In order to start the Box, execute the following:
cd /<path_to_dir>/vagrant_ConSurf
vagrant up --no-destroy-on-error
vagrant ssh
After the box has been started and You are connected to the VM through SSH, run export LC_ALL=C
.
All dependency libraires from section Installing Libraries should have been installed during execution of vagrant up
, rest can be set up according to Installation guidelines presented earlier.