All documentation for Synima can be found at: https://github.com/rhysf/Synima Synima (Synteny Imager) is a program for visualising syntenic regions from orthologous genes, infered by reciprocal best hits (RBH) from BLAST, or OrthoMCL, followed by DAGchainer. OrthoMCL and DAGchainer are bundled with Synima, and wrapper scripts are provided in /util/. A number of the scripts launch several jobs that can be run on a grid (LSF, GridEngine, UGER are supported). Prerequisites: -------------- Perl Bio-Perl Python R Legacy-blast Getting started / example pipeline: ----------------------------------- git clone git@github.com:rhysf/Synima.git cd Synima/examples perl ../SynIma.pl -a Repo_spec.txt.dagchainer.aligncoords \ -b Repo_spec.txt.dagchainer.aligncoords.spans Full example pipeline --------------------- git clone https://github.com/rhysf/Synima.git cd Synima/examples perl ../util/Create_full_repo_sequence_databases.pl -r ./Repo_spec.txt perl ../util/Blast_grid_all_vs_all.pl -r ./Repo_spec.txt perl ../util/Blast_all_vs_all_repo_to_OrthoMCL.pl -r ./Repo_spec.txt ALTERNATIVELY 1: ../util/Blast_all_vs_all_repo_to_RBH.pl -r ./Repo_spec.txt ALTERNATIVELY 2: ../util/Blast_all_vs_all_repo_to_Orthofinder.pl -r ./Repo_spec.txt perl ../util/Orthologs_to_summary.pl -o all_orthomcl.out perl ../util/DAGchainer_from_gene_clusters.pl -r ./Repo_spec.txt \ -c GENE_CLUSTERS_SUMMARIES.OMCL/GENE_CLUSTERS_SUMMARIES.clusters perl ../SynIma.pl -a Repo_spec.txt.dagchainer.aligncoords \ -b Repo_spec.txt.dagchainer.aligncoords.spans Description of example pipeline: -------------------------------- Synima visualises aligncoords and aligncoords.spans files, which are tab delimited text based coordinate files describing sub-genomic regions of synteny between two or more genomes. Having cloned a local copy of all the code using git clone, and navigated to the examples sub-folder, the first step is to create a 'repo sequence database'. Create_full_repo_sequence_databases.pl reads in a Repository specification file (example Repo_spec file provided in examples) and outputs two fasta files (Repo_spec.txt.all.cds and Repo_spec.txt.all.pep) which are merged from each of the genome folders and used later. The Input Repo_spec files take the format of: // Genome CNB2 Annotation CNB2_FINAL_CALLGENES_1 // Genome Cryp_gatt_IND107_V2 Annotation Cryp_gatt_IND107_V2_FINAL_CALLGENES_1 // For each genome ID (E.g. CNB2) listed in the Repo_spec, a corresponding sub-folder with the same name must be present in the same directory as the Repo_spec. In each genome folder, a genome.fasta must be present, and named [genome-id].genome.fa E.g. CNB2/CNB2.genome.fa. Additionally, Annotation files are specified in the Data_repo, which should refer to the prefix of 3 additional files in each genome directory: [annotation-id].annotation.gff3 [annotation-id].annotation.cds [annotation-id].annotation.pep The GFF should have gene or mRNA features, with identifiers that are also in the .cds and .pep FASTA files, which contain the Coding sequence (cds) and peptide (pep) sequences respectively. GFF ID's can be listed as such: cgbd CNB_WM276_v2 mRNA 450360 453023 . + . ID=012346;Parent=012345;Name=actin or any other way that is specified explicitly for Create_full_repo_sequence_databases.pl. Gene ID's in the two FASTA files can be a single word i.e. >CBBG_0001 or they can have multiple fields e.g. >01 gene_id=02 locus=03 name="ATPS" genome=Esch_coli analysisRun=Esch_coli_Augustus Note: Sequences from separate genomes must have distinct names. i.e. if chr1 and gene1 are names in genome1, genome2 should not have any sequences called chr1 and gene1. Non-unique names causes issues with BLAST resulting in errors later in the pipeline. Gene and contig names should also ideally be alphanumerical, and avoid symbols such as '='. The second step is to run all vs all BLAST hits using Blast_grid_all_vs_all.pl. Peptide or nucleotide alignments are possible, although peptide is generally recommended. This step can take a long time, and therefore, the option of distributing jobs to a cluster via LSF, SGE and UGE is provided (if available). This step will create folders in each of the genome folders called RBH_blast_[PEP/CDS]. This step requires BLAST legacy (formatdb and blastall) to be in $PATH. The third step is to run OrthoMCL or reciprocal best hits (RBH) on the BLAST output using Blast_all_vs_all_repo_to_OrthoMCL.pl or Blast_all_vs_all_repo_to_RBH.pl respectively. This will create an OMCL_outdir or RBH_outdir, containing all_orthomcl.out or PEP.RBH.OrthoClusters. RBH will likely be less accurate than OrthoMCL, but OrthoMCL has a limited number of genomes/genes that can be compared due to memory constraints. The forth step is to summarise the OrthoMCL output (OMCL_outdir/all_orthomcl.out), or RBH output (RBH_outdir/PEP.RBH.OrthoClusters) or Orthofinder output (Orthofinder_outdir/ Orthogroups.csv) using Orthologs_to_summary.pl. This step will create ortholog predictions in the output folders GENE_CLUSTERS_SUMMARIES.OMCL or GENE_CLUSTERS_SUMMARIES.RBH or GENE_CLUSTERS_SUMMARIES.Orthofinder respectively. The fifth step is to run DAGChainer on the ortholog summary using DAGchainer_from_gene_clusters.pl. The sixth and final step is to run SynIma.pl on the aligncoords and aligncoords.spans output from DAGChainer. Once you have identified orthologs with the previous steps 1-5, you can re-run only this step with updated parameters to generate new figures. If Synima finds the config.txt file (generated from the first time run, and in the same folder as the figure, by default SynIma-output/config.txt), it will run using the parameters specified in this file (rather than use any updated parameters on the command line). Config.txt includes a number of parameters that can change the appearance or layout of the figure. I recommend plotting both chromosome/contig synteny (c) and gene synteny (g) separately, as either can give greater clarity depending on the input. By default, Synteny is shown as a partially transparent (alpha factor 0.5) azure4, although this can be changed to any other R color (E.g. http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf). Due to the color transparency, overlapping synteny will appear shaded. Individual script details: -------------------------- SynIma and wrapper script parameters are shown below, followed by their default settings given in []. Create_full_repo_sequence_databases.pl Parameters: -r ./Repo_spec.txt [] Optional: -f Feature wanted from GFF [mRNA] -s Seperator in GFF description for gene names (\" ; etc) [;] -d GFF description part number with the parent/gene info [0] -m Remove additional comments in column [Parent=] Notes: Will copy all transcripts and specified features from GFF into primary fasta files Blast_grid_all_vs_all.pl Parameters: -r ./Repo_spec.txt [] Optional: -t Type of alignment. Can be either PEP (peptide) or CDS (Coding sequence) [PEP] -c Number of best matches to capture between species [5] -s Number of top hits to capture in self-searches for paralogs [1000] -e E-value cutoff [1e-20] -o Blast cmds outfile [blast.$type.cmds] -g Run commands on the grid (y/n) [n] -p Platform (UGER, LSF, GridEngine) [UGER] -q Queue name [short] Notes: BLAST legacy tools formatdb and blastall are needed in $PATH Blast_all_vs_all_repo_to_OrthoMCL.pl Parameters: -r ./Repo_spec.txt [] Optional: -t Type of alignment. Can be either PEP (peptide) or CDS (Coding sequence) [PEP] -o Out directory [OMCL_outdir] Blast_all_vs_all_repo_to_Orthofinder.pl Parameters: -r ./Repo_spec.txt [] Optional: -t Type of alignment. Can be either PEP (peptide) or CDS (Coding sequence) [PEP] -o Out directory [Orthofinder_outdir] Blast_all_vs_all_repo_to_RBH.pl Parameters: -r ./Repo_spec.txt Optional: -t Type of alignment. Can be either PEP (peptide) or CDS (Coding sequence) (PEP/CDS) [PEP] -o Out directory [RBH_outdir] -g Run commands on the grid (y/n) [n] -p Platform (UGER, LSF, GridEngine) [UGER] -q Queue name [short] Orthologs_to_summary.pl Parameters: -o Ortholog file (E.g. PEP.RBH.OrthoClusters, all_orthomcl.out, Orthogroups.csv) [] Optional: -t Type of clustering (OMCL, RBH, Orthofinder) [OMCL] -d Outdir from Blast_all_vs_all_repo_to_OrthoMCL (if used) [OMCL_outdir] -r Repo Spec [./Repo_spec.txt] -p Repo Spec Peptide file [./Repo_spec.txt.all.PEP] DAGchainer_from_gene_clusters.pl Parameters: -r ./Repo_spec.txt [] -c Ortholog cluster data (E.g. ORTHOMCLBLASTFILE.clusters) [] Optional: -z File containing a list of genomes to restrict the analysis to [] -i Minimum number of paired genes in a single dagchain [4] -o Cmds outdir [dagchainer_rundir] -l Cmds outfile [cluster_cmds] -g Run commands on the grid (y/n) [n] -p Platform (UGER, LSF, GridEngine) [UGER] -q Queue (hour, short, long) [short] -v Verbose (y/n) [n] Notes: GFF specifications (-f, -s, -d, -m) need to be the same as specified during Blast_all_vs_all_repo_to_OrthoMCL.pl or Blast_all_vs_all_repo_to_RBH.pl"; SynIma.pl Parameters: -c ./Config.txt [$cwd/SynIma-output/config.txt] -a ./Aligncoords [] -b ./Aligncoords.spans [] Optional: -e Genome FASTA filename extension (e.g. ./SynIma/genome1/genome1.genome.fa etc.) [genome.fa] -t Aligncoords.spans 2 [] -u Aligncoords.spans 3 [] -k Gene IDs 1 (1 per line) [] -l Gene IDs 2 (1 per line) [] -o Gene IDs 3 (1 per line) [] -r Run full program (y) or just create config (n) [y] -v Verbose output (y/n) [n] Plot Opts: -i Width of figure in pixels [1100] -j Height of figure in pixels (num of genomes * 100) -g Fill in chromosome/contig synteny (c) or gene synteny (g) [c] -z Plot individual genes (y/n) [n] -x Order of genomes from bottom to top seperated by comma -n Genome labels from bottom to top seperated by comma -w number of lines for left hand margin [12] Notes: Config.txt will be made automatically if not present, and read automatically if it is. Config.txt specifies order of genomes, chromosomes, colours, and other plot options. Config.txt can be manually edited after creation. Default genome labels will be as they appear in aligncoords Order of genomes must have names as they appear in aligncoords Aligncoords.spans and Gene ID files will be highlighted according to the config