/GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects

Primary LanguageC++Apache License 2.0Apache-2.0

GLnexus

From DNAnexus R&D: scalable gVCF merging and joint variant calling for population sequencing projects. (GL, genotype likelihood)

In our manuscript with collaborators at Regeneron Genetics Center and Baylor College of Medicine, we detail the design of GLnexus and scientific validation using up to 240,000 human exomes and 22,600 genomes. Compared to the DNAnexus cloud-native deployment used for such large projects, this open-source version produces identical scientific results but lacks some of the scalability and production-oriented features.

The Getting Started wiki page has a tutorial for first-time users.

For each tagged revision, the Releases page has a static executable suitable for most Linux x86-64 hosts; just download it and chmod +x glnexus_cli.

Build & test

Coverage Status

The GLnexus build process has a number of dependencies, but produces a standalone, statically-linked executable glnexus_cli. The easiest way to build it is to use our Dockerfile to control all the compile-time dependencies, then simply copy the static executable out of the resting Docker container and put it anywhere you like.

# Build GLnexus using its Dockerfile.
# You can set a specific git revision by adding --build-arg=git_revision=xxxx
curl -s https://raw.githubusercontent.com/dnanexus-rnd/GLnexus/master/Dockerfile \
    | docker build --no-cache -t glnexus_tests -

# Run GLnexus unit tests.
docker run --rm glnexus_tests

# Copy the static GLnexus executable to the current working directory.
docker run --rm -v $(pwd):/io glnexus_tests cp glnexus_cli /io

# Run it to see its usage message.
./glnexus_cli

To build GLnexus without Docker, make sure you have gcc 5+, CMake 3.2+, and all the dependencies indicated in the Dockerfile.

Then,

git clone --recursive https://github.com/dnanexus-rnd/GLnexus.git
cd GLnexus
cmake -Dtest=ON . && make -j$(nproc) && ctest -V

You will also find ./glnexus_cli here.

Coding conventions

  • C++14 - take advantage of the goodies
  • Use smart pointers to avoid passing resources needing manual deallocation across function/class boundaries
  • Prefer references over pointers when they shouldn't be null nor change ever.
  • Avoid exceptions; prefer returning a Status, defined early in types.h
  • nb the frequently-used convenience macro S() defined just below Status
  • Avoid public constructors with nontrivial bodies; prefer static initializer function returning Status
  • Avoid elaborate templated class hierarchies

Libraries used

Performance profiling

The Performance wiki page has practical advice for deploying GLnexus on a powerful server.

The code has some hooks for performance profiling using perf and FlameGraph.

To profile performance within the DNAnexus applet run the applet as usual plus -i perf=true. This produces an output file genotype.stacks containing sampling observation counts for common call stacks. To generate an SVG visualization with FlameGraph:

git clone https://github.com/brendangregg/FlameGraph
FlameGraph/flamegraph.pl < genotype.stacks > genotype.svg