/syngraph

Toolkit for evolutionary analyses of linkage groups

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Syngraph

A toolkit for evolutionary analyses of linkage groups

Dependencies

Best addressed via conda

$ conda install -c conda-forge networkx pandas docopt tqdm ete3 pygraphviz

Usage

Usage: syngraph <module> [<args>...] [-D -V -h]

  [Modules]
    build               Build graph from orthology data (e.g. BUSCO *.full_table.tsv)
    infer               Model rearrangements over a tree
    tabulate            Get table of extant and ancestral genomes
    viz                 Visualise graph/data [Under development]
    
  [Options]
    -h, --help          Show this screen.
    -D, --debug         Print debug information [TBI]
    -v, --version       Show version

  [Dependencies] 
    ---------------------------------------------------------------------------------------------
    | $ conda install -c conda-forge networkx=2.4 pandas docopt tqdm ete3 pygraphviz matplotlib |
    ---------------------------------------------------------------------------------------------

Build a syngraph from BUSCO data, allowing for missingness

syngraph build -d directory_of_tsv_files -m -o test

Model fissions and fusions over a tree, record rearrangements using taxon_1 as a reference

syngraph infer -g test.pickle -t newick.txt -r 2 -s taxon_1 -o test

Model translocations, fissions and fusions over a tree

syngraph infer -g test.pickle -t newick.txt -r 3 -s taxon_1 -o test

Tabulate extant and inferred genomes

syngraph tabulate -g test.with_ancestors.pickle -o test

Input data

Input data should only contain markers from chromosome-scale sequences as unscaffolded contigs will result in excess fission events being inferred.

If using BUSCO data, tsv files should be named My_taxon.\*.tsv where My_taxon is also a leaf in the newick tree. Each row should contain the BUSCO_ID, sequence, start position, and end position. These can be grepped from the *full_table.tsv file generated by BUSCO (Busco_id, Sequence, Gene_Start, Gene_End). E.g.:

0at7088 HG995313.1      5723272 5863707
1at7088 HG995286.1      19966914        20084934
2at7088 HG995296.1      11128843        11215510

Inferring rearrangements

After building a syngraph, inter-chromosomal rearrangements can be inferred with syngraph infer. This requires a newick tree relating the taxa in the analysis. Branch lengths are used by syngraph but this only influences how the tree is traversed, so approximate branch lengths are fine.

The -r option sets the inference mode, 2 for fissions and fusions, and 3 for fissions, fusions, and reciprocal translocations (which is currently experimental).

The -m option sets the minimum number of markers that can be involved in a rearrangement. Setting -m 1 will mean that a rearrangement will be reported when a single marker 'moves' between chromosomes. By contrast, setting higher values, e.g. -m 100, will mean that chromosome fissions or sets of complex rearrangements will be missed. A reasonable starting point is -m 5 although this may need to be adjusted given the density of markers, size of chromosomes, and accuracy of marker orthology.

The most useful output file is *.rearrangements.tsv. This lists rearrangements inferred over the tree. The branch of the tree where a rearrangement happened is denoted by its parent and child nodes. The event is reported as fission/fusion/translocation. Multiplicity is the number of events. This is normally 1, but can be more if a chromosome has fissioned into mutliple fragements. The last column is ref_seqs, and shows which chromosomes are involved in the rearrangement given an extant genome, an inferred ancestral genome, or a predefined list of marker --> chromosome relationships.

#parent child   event   multiplicity    ref_seqs
n7      Brenthis_ino    fusion  1       [['n5_2', 'n5_17'], ['n5_20']]
n5      n7      fusion  1       [['n5_6'], ['n5_19']]

Help

Syngraph is still under active development. Please open an issue if you have any questions about running the software or interpreting your results.