/enredo

Graph-based method to detect co-linear segments among multiple genomes

Primary LanguageShellOtherNOASSERTION

Enredo v0.5

=======================================
 SYNOPSIS
=======================================

Usage: enredo [options] anchors_file.txt

Options:
 --max-gap-length: maximum allowed gap between two anchors (def: 200000)
 --min-score: minimum score required to accept a hit (def: 0)

 --max-path-dissimilarity: merge alternative paths in the graph if their
      dissimilarity is up to this threshold (def: 4)
 --simplify-graph: try to split small edges in order to lengthen
      other blocks. Ranges from 0 (none) to 7 (more aggressive). (def: 7)

 --min-length: minimum length of valid block (def: 100000)
 --min-regions: minimum number of regions in a valid block (def: 2)
 --min-anchors: minimum number of anchors in a valid block (def: 3)
 --[no]bridges: validate invalid blocks that bridge two valid blocks (def: yes)
 --all: print all the blocks (overwrite previous values)

 --output: write output to that file (def: STDOUT)

 --help: prints this help

=======================================
 DESCRIPTION
=======================================

This program reads an input file containing the results of mapping a set of
anchors onto several genomes. Ideally, each anchor should be found several
times. This program links the anchors according to their positions on the
genomes and try to concatenate the links (edges) when possible. Let be G1 and
G2 two (pieces of) genomes and A, B, C... some anchors.

G1: A---B---C---D---E---F---G
G2: A----B--C-----E-----F---G

The initial graph will have a link from A to B which is found on G1 from 1
to 4 and on G2 from 1 to 5. There will be another link from B to C which
is found on G1 from 4 to 8 and on G2 from 5 to 8. Here is the full list of
links for the previous example:

A-B: G1(1-4); G2(1-5)
B-C: G1(4-8); G2(5-8)
C-D: G1(8-12)
C-E: G2(8-14)
D-E: G1(12-16)
E-F: G1(16-20); G2(14-20)
F-G: G1(20-24); G2(20-24)

A-B and B-C can be concatenated because A-B is always followed by B-C:

A-B-C: G1(1-8); G2(1-8)
C-D: G1(8-12)
C-E: G2(8-14)
D-E: G1(12-16)
E-F: G1(16-20); G2(14-20)
F-G: G1(20-24); G2(20-24)

The same happens with C-D and D-E:

A-B-C: G1(1-8); G2(1-8)
C-D-E: G1(8-16)
C-E: G2(8-14)
E-F: G1(16-20); G2(14-20)
F-G: G1(20-24); G2(20-24)

And E-F and F-G:

A-B-C: G1(1-8); G2(1-8)
C-D-E: G1(8-16)
C-E: G2(8-14)
E-F-G: G1(16-24); G2(14-24)

Using this simple operation, we can use the resulting links to define colinear
regions.

=======================================
 OPTIONS
=======================================

* FOR BUILDING THE GRAPH *

--max-gap-length:
If the distance between two anchors is larger than this value, they wont be
linked. A value of 0 disable this option.
Default: 20000

--min-score:
Ignore anchors with a score lower than this value.
Default: 0

* FOR EDITING THE GRAPH *

--max-path-dissimilarity:
Allow to merge down paths if their dissimilarity is lower or equal to this
threshold. For instance, the dissimilarity between paths A--B--C--D and
A--B--C--E--D is one as there is one extra anchor in the second path. In
this other example: A--B--C--D and A--C--B--D, the dissimilarity is 2 because
in the second path an extra C has been inserted between A and B, and there is
a C missing between B and D.
Default: 4

* FOR DEFINNING THE VALID COLINEAR REGIONS *

--min-length:
Only blocks which are longer than this limit will be printed out. The length
of the block correspond to the shortest of the sequences in that block.
Default: 100000

--min-regions:
Only blocks containing at least this number of sequences will be dumped.
Default: 2

--min-anchors:
Only blocks having at least this number of anchors will be dumped. At the
beginning, the blocks have only two anchors and a new anchor is added in
each concatenation. The more anchors, the better supported is the block.
Default: 3

--bridges: validate invalid blocks that bridge two valid blocks (def: yes)

--all: print all the blocks (overwrite previous values)
Prints everything, even short blocks with one single region.


=======================================
 INPUT FILE
=======================================

The input file contains the result of mapping a set of anchors onto several
genomes. Anchors are expected to be sorted by organism, chromosome and
position. Each line should correspond to an anchor and each line contains 6
values separated by tabs. The six values are: the anchor name (a string),
the species name (a string), the chromosome name (a string), the start
position (an integer value), the end position (an integer value), the strand
(either + or -) and the score (a real value). Here is an example:

A1      Spcs1   X       53      85      +       123
B1      Spcs1   X       458     498     +       11
C1      Spcs1   X       3601    3639    +       434
B1      Spcs1   X       5480    5520    +       1
D1      Spcs1   X       6479    6510    +       41
A       Spcs1   Y       1379    4410    +       1567
E       Spcs1   Y       5879    5910    +       311
E       Spcs1   Y       6479    6510    +       217
D       Spcs1   Y       6567    6593    +       135
D       Spcs1   Y       7567    7593    +       14
D       Spcs1   Y       8567    8593    +       617
A       Spcs1   Y       9863    9893    +       133
C       Spcs1   Y       10187   10218   +       714
A1      Spcs2   X       53      85      +       17
B1      Spcs2   X       458     498     +       13
C1      Spcs2   X       3601    3639    +       13
B1      Spcs2   X       5480    5520    +       881
D1      Spcs2   X       6479    6510    +       16
A       Spcs2   Y       4379    4410    +       51
E       Spcs2   Y       5879    5910    +       17
E       Spcs2   Y       6479    6510    +       13
D       Spcs2   Y       6567    6593    +       41
D       Spcs2   Y       7567    7593    +       14
D       Spcs2   Y       8567    8593    +       18
A       Spcs2   Y       9863    9893    +       19
C       Spcs2   Y       10187   10218   +       31

At the moment, the strand is ignored but it might be used in a future
version.