/mcscan

Command-line program to wrap dagchainer and combine pairwise results into multi-alignments in column format

Primary LanguageC++

Welcome to MCscan's documentation!

The MCscan download page can be accessed here.

Usage

This software provides a clustering module for viewing the relationship of colinear segments in multiple genomes (or heavily redundant genomes). It takes the predicted pairwise segments from dynamic programming (DAGchainer in particular) and then try to build consensus segments from a set of related, overlapping segments.

Certain part of this package (dagchainer.cc) is based on the TIGR software DAGchainer. The program used this as an initial step to generate pairwise segments.

Along with the DAGchainer guidelines, all code is copiable, distributable, modifiable, and usable without any restrictions.

Installation

Note

MCscan currently will only run on linux or cygwin platform, as it is dependent on GNU function.

Simply put mcscan.tar.gz in any directory:

$ tar zxf mcscan_version.tar.gz
$ cd mcscan_version.tar.gz/ && make

the compiled codes are within the same directory as the source.

Then put copy of MCL executable within the same folder as MCscan (MCL program downloadable here).

Inputs and outputs

MCscan reads in at least two sources of data: .blast file and .bed file. This may seem daunting at first, but these are very easy to retrieve. Have a look at the at_at.blast, at_at.bed in the folder. In the actual execution, MCL is used to generate mcl file (at_at.mcl), which is used in multiple synteny construction.

Here is what can be used to genenerate the files.

The blast file is the following tab-delimited format:

gene1    gene2    e-value

easily genenerated from a m8 blast output format:

$ cut -f 1,2,11 xyz.m8 > xyz.blast.unfiltered

The first thing please ensure that for each gene pair, only one e-value is reported, the blast output normally would contain multiple HSPs, a convenience script is attached to filter all the redundant pairs:

$ python filter_blast.py xyz.blast.unfiltered xyz.blast

The .bed file contains the following tab-delimited format (see bed format):

chromosome_id    start    stop    gene_name

notice when you compare multiple genomes, formulate your molecule name carefully to avoid duplicated names. The .bed file can usually be generated by parsing the gene annotation file provided by the sequencing group (usually the sequencing project ftp will provide a .gff3 file).

Once you have everything ready, put them in the same folder. We need to generate .mcl file if this is the first run (also take a look at the example in run.sh):

$ more xyz.blast | mcl - --abc --abc-neg-log -abc-tf 'mul(0.4343), ceil(200)' -o xyz.mcl

Some might encounter a problem exec the mcl command, in which case the mcl binary needs to be rebuilt from here. After the first time you run it (the mcl file has been generated). You can simply use:

$ ./mcscan xyz

Parameters (for advanced user)

The help screen:

Usage: mcscan [OPTION...] prefix_fn
MCSCAN -- multiple collinearity scan (compiled Jul  6 2010 15:52:03)

Reference:
 Tang,H., Wang,X., Bowers,J.E., Ming,R., Alam,M., Paterson,A.H.
 Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps
 Genome Research (2008) 18, 1944-1954.

  -a                         only builds the pairwise blocks (.aligns file)
  -A                         use base pair dist instead of gene ranks
  -b                         limit within genome synteny (e.g. Vv-Vv) mapping
  -e, --e_value=E_VALUE      alignment significance
  -g, --gap_score=GAP_SCORE  gap penalty
  -k, --match_score=MATCH_SCORE   final score=MATCH_SCORE+NUM_GAPS*GAP_SCORE
  -p, --pivot=PIVOT          PIVOT is the reference genome, make it two letter
                             prefix inyour .bed file, everything else will be
                             aligned to the reference
  -s, --match_size=MATCH_SIZE   number of genes required to call synteny
  -u, --unit_dist=UNIT_DIST  average intergenic distance
  -?, --help                 Give this help list
      --usage                Give a short usage message
  -V, --version              Print program version

Mandatory or optional arguments to long options are also mandatory or optional
for any corresponding short options.

Report bugs to <bao@uga.edu>.

The default values are quite generic and should work for many instances. The following are more detailed information for users who wish to tune their results.

The pairwise synteny formula is roughly (Haas et al. 2004), note that DIST_X, DIST_Y is in base pair:

FINAL_SCORE = MATCH_SCORE*num_matches+max(DIST_X, DIST_Y)/UNIT_DIST*GAP_SCORE

The multiple synteny formula is roughly, now DIST_X, DIST_Y is the distance in the partial order graph (not in base unit, but in gene index unit):

FINAL_SCORE = MATCH_SCORE*num_matches+max(DIST_X, DIST_Y)*GAP_SCORE

Sometimes you may want to run just the pairwise synteny on .blast and .bed files, then you can try:

$ ./mcscan at_at -a

Note that to run this, .mcl file is not required, the result is now slightly different, since MCscan uses the mcl file to filter the BLAST hits.

Walkthrough example

There are, by default at_vv sets of files and os_sb sets of files, which is basically two different projects.

First example, let us compare Os to Sb (rice to sorghum), just default settings, run:

$ ./mcscan os_sb

It takes about one minute to run, the result is best viewed in EXCEL. The first part of the file lists all the parameters of the program. The result is separated with a line like this:

## View 11: pivot Sb02

This is called a view, each view uses a different chromosome as the reference. Then the blocks following this line is the multiply aligned blocks. The first column is numerical identifier, the second column is the actual pivot. Then following columns are the regions that are aligned to the pivot. The alignments between rice and sorghum are in fact complicated by one or more shared WGDs, creating several columns but mostly are four regions matching each other.

For the second example, we wish to align Arabidopsis to grape, and use grape as the reference genome, but we need to do it a little differently. Unlike the first example, we are not interested in WGD in grape in this case, and we only wish to see the grape used as pivot. Therefore, we modify the pivot:

$ ./mcscan at_vv -p Vv -b

This trick -b will limit any Vv-Vv matches (in fact this is an older duplication called gamma) in the output.

There are two outputs. .aligns file and .blocks file, corresponding to pairwise and multiple synteny respectively. You will find the .aligns file very useful too, sometimes. But this is essentially similar to the output of DAGchainer (adding a few statistics and change the default paramters).

Changelog

  • May 12, 2007 (version <0.5) initial release.
  • Aug 05, 2007 (version 0.5) add the option of of a reference genome
  • Oct 13, 2007 (version 0.6) add convenience python script to streamline the process
  • Mar 07, 2008 (version 0.7) implement statistical test for pairwise syntenic blocks
  • Nov 13, 2008 (version 0.8) partial-order graph for alignment

Contact

Any questions, problems, bugs are welcome and should be dumped to

Haibao Tang : bao at uga dot edu

Plant Genome Mapping Laboratory, University of Georgia