Scripts to parse and analyse MCScanX collinearity output.
A typical MCScanX analysis following gene finding using Augustus or Braker might consist of:
- Run Diamond:
>> diamond makedb --in augustus.aa -d augustus.aa >> diamond blastp -e 1e-5 -p 8 -q augustus.aa -d augustus.aa -a augustus.aa.vs.self >> diamond view -a augustus.aa.vs.self.daa -o Xyz.blast
- Generate CDS (required for Ka/Ks calculation):
>> /path/to/augustus-3.2.1/scripts/getAnnoFasta.pl --seqfile=mygenome.fasta augustus.gff
- Generate GFF:
>> perl -lane 'print join("\t",$F[0],$F[8],$F[3],$F[4]) if ($F[2]eq"transcript")' augustus.gff > Xyz.gff
- Run MCScanX:
>> mkdir results >> mv Xyz* results/ >> /path/to/MCScanX/MCScanX results/Xyz >> /path/to/MCScanX/duplicate_gene_classifier results/Xyz
This will produce the necessary files to then run the downstream analysis scripts outlined below.
This script annotates the Xyz.collinearity output file from MCScanX analysis with Ka and Ks values for all pairs of genes.
Requires ClustalO to be in $PATH
and some BioPerl libraries.
Type add_kaks_to_MCScanX.pl -h
to see the options:
OPTIONS
-i|--in [FILE] : MCScanX collinearity file
-p|--prot [FILE] : fasta file of protein sequences
-c|--cds [FILE] : fasta file of corresponding CDS (nucleotide)
-t|--threads [INT] : number of aligner threads (default = 1)
-h|--help : this message
MCScanX collinearity file annotated with Ka and Ks per pair of genes.
Calculates collinearity score based on the number of collinear genes divided by the total number of genes within that defined block (from Flot et al. 2013). Also calculates average Ka and Ks per block, if run on annotated MCScanX file.
-i|--in [FILE] : MCScanX collinearity file
-g|--gff [FILE] : MCScanX GFF file
-k|--kaks : parse collinearity file to get average ka & ks per block
-h|--help : print this message
A score file with per-block collinearity scores, average Ka and Ks etc. Also writes an MCScanX collinearity file reformatted to remove some of the strange leading whitespaces etc.
Searches for breaks in collinearity defined as occurrences where homologous blocks cannot be aligned along scaffolds without some rearrangement. Also counts cases where homologous regions are found physically linked on the same scaffold.
Type calculate_collinarity_breakpoints.pl -m
to see full help and definitions.
-i|--in [FILE] : MCScanX collinearity file
-g|--gff [FILE] : MCScanX GFF file
-s|--score [FILE] : score file with average Ks per block
-k|--ks [FLOAT] : Ks threshold to define homologous block (default <= 0.5)
-b|--blocks : also print blocks_per_gene file (default=no)
-h|--help : print help
-m|--morehelp : print more help
A collinearity break is introduced if there is a mistmatch in the identity of the upstream or downstream homologous blocks from one collinear region to another. Thus, if a[i]
is the identity of the focal block on chrom a
, homologous with block b[j]
on chrom b
, collinearity is broken if a[i+/-1] != b[j+/-1]
. A caveat is when the focal block is located at the terminus of a scaffold; here, collinearity is not broken if the homologous scaffold can be orientated 'away' from the focal scaffold.
- Collinearity cannot be broken by scaffolds containing only 1 collinear block.
- Collinearity cannot be broken when focal and subject blocks are BOTH terminal on their respective scaffolds.
SUBJECT is SINGLETON
a----1--2-----3-4----5---
b --2---
a and b are inferred collinear, since b contains only 1 block
FOCAL and SUBJECT are TERMINAL
a----1--2-----3-4----5---
c ---5---6--7--8--
a and c are collinear, since orientation of scaffolds WRT each other is unknown and may align as shown
FOCAL or SUBJECT are TERMINAL
a----1--2-----3-4----5---
d -4--10-11----12---
a and d show a break in collinearity, since i=4, a[4+1]=5, but b[j+1]=10, and similarly a[i-1]=2 but still cannot align with b if orientated in the other direction
neither FOCAL or SUBJECT are TERMINAL
a----1--2-----3-4----5---
e 18---2--19-20----21-----
a and e are clearly not collinear
Searched for cases of homologous regions physically linked on the same scaffolds. Decomposes any such cases into tandem arrays or palindromes.
-i|--in [FILE] : MCScanX collinearity file (annotated with Ks)
-g|--gff [FILE] : MCScanX GFF file
-k|--ks : only examine blocks with Ks <= this threshold
-h|--help : print this message
- arrays file: all results
- arrays.tandem: tandem arrays
- arrays.palindrome: palindromic arrays
There are a bunch of other helper scripts for Circos plotting... type -h
for options.