/cosmo

Cosmo is a fast, low-memory DNA assembler using a Succinct (variable order) de Bruijn Graph.

Primary LanguageC++GNU General Public License v3.0GPL-3.0

                                                              88                                 o8o  
                                                             .8'                                 `"'  
 .ooooo.   .ooooo.   .oooo.o ooo. .oo.  .oo.    .ooooo.     .8'  oooo    ooo  .oooo.   oooo d8b oooo  
d88' `"Y8 d88' `88b d88(  "8 `888P"Y88bP"Y88b  d88' `88b   .8'    `88.  .8'  `P  )88b  `888""8P `888  
888       888   888 `"Y88b.   888   888   888  888   888  .8'      `88..8'    .oP"888   888      888  
888   .o8 888   888 o.  )88b  888   888   888  888   888 .8'        `888'    d8(  888   888      888  
`Y8bod8P' `Y8bod8P' 8""888P' o888o o888o o888o `Y8bod8P' 88          `8'     `Y888""8o d888b    o888o 
                                                                                                  
                                                                                            ver 0.5.1

Cosmo

Version: 0.5.1

Cosmo is a fast, low-memory DNA assembler that uses a succinct de Bruijn graph.

VARI, a succinct colored de Bruijn graph, can be found in the VARI branch.

Usage

After compiling, you can run Cosmo like so:

$ pack-edges <input_file> # this adds reverse complements and dummy edges, and packs them
$ cosmo-build <input_file>.packed # compresses and builds indices
$ cosmo-assemble <input_file>.packed.dbg # output: <input_file>.packed.dbg.fasta # NOT IMPLEMENTED YET

Where input_file is the binary output of a DSK run. Each program has a --help option for a more detailed description of how to use them.

Caveats

Here are some things that you don't want to let surprise you:

DSK Only

Currently Cosmo only supports DSK files with k <= 64 (so, 128 bit or less blocks). Support is planned for DSK files with larger k, and possibly output from other k-mer counters.

Definition of "k-mer"

Note that since our graph is edge-based, k defines the length of our edges, hence our nodes are only k-1 symbols long. If you want to construct a Succinct de Bruijn Graph where the nodes are k-mers, you will need to run DSK with k set to k+1. E.g. using output from $ dsk <input_file> 27 will actually build a 26-dimension de Bruijn graph.

Note: Both even and odd k values should work with this assembler due to our loop-immune traversal.

Furthermore, most de Bruijn graph based assemblers add edges between all nodes that overlap. Instead, we are taking the k-mers as our edges (of two k-1-length nodes), so we only have edges that were directly represented in the read set (this makes more sense to us, though, as it reduces unnecessary branching). I may add support for the standard way in the future if anyone wants it (it would be similar to the dummy edge adding code).

Graph Traversal

We currently only output the unitigs (paths between branching nodes).

Compilation

There is an included Makefile - just type make to build it (assuming you have the dependencies listed below). To build with "Variable order mode", use the varord=1 flag.

Dependencies

  • A compiler that supports C++11,
  • Boost - ranges and range algorithms, zip iterator, tuple comparison, lots of good stuff,
  • SDSL-lite - low level succinct data structures (For now you will have to use my branch if you want to use variable order graphs: clone this and checkout the develop branch before compiling),
  • TClap - command line parsing,
  • DSK - k-mer counting (we need this for input),
  • Optionally (for developers): Python and NumPy - rebuilding the lookup tables,
  • STXXL - external merging (not actually required yet though)

Many of these are all installable with a package manager (e.g. (apt-get | yum | brew ) install boost libstxxl tclap). However, you will have to download and build these manually: DSK and SDSL-lite.

Authors

Implemented by Alex Bowe. Original concept and prototype by Kunihiko Sadakane.

These people also proved incredibly helpful: Rayan Chikhi, Simon Puglisi, Travis Gagie, Christina Boucher, Simon Gog, Dominik Kempa.

Contributing

Your help is more than welcome! Please fork and send a pull request, or contact me directly :)

Why "Cosmo"?

Cosmos /ˈkɑz.moʊs/ (n) : "An ordered, harmonious whole.".

If that doesn't suit an assembly program then I don't know what does. The last s was dropped because it was nicer to say. Furthermore, it is a reference to the Seinfeld character Cosmo Kramer (whose last name I'm often reminded of while working on this stuff).

License

This software is copyright (c) Alex Bowe 2024 (alex at alexbowe dot com). It is released under the GNU General Public License (GPL) version 3.