/seqwish

alignment to variation graph inducer

Primary LanguageC++MIT LicenseMIT

seqwish

build and test install with bioconda

These sequences wish they were squished into a graph.

a variation graph inducer

seqwish implements a lossless conversion from pairwise alignments between sequences to a variation graph encoding the sequences and their alignments. As input we typically take all-versus-all alignments, but the exact structure of the alignment set may be defined in an application specific way. This algorithm uses a series of disk-backed sorts and passes over the alignment and sequence inputs to allow the graph to be constructed from very large inputs that are commonly encountered when working with large numbers of noisy input sequences. Memory usage during construction and traversal is limited by the use of sorted disk-backed arrays and succinct rank/select dictionaries to record a queryable version of the graph.

Citation:

Erik Garrison, Andrea Guarracino, Unbiased pangenome graphs, Bioinformatics, Volume 39, Issue 1, January 2023, btac743, https://doi.org/10.1093/bioinformatics/btac743

squish graph induction algorithm

As input we have Q, which is a concatenation of the sequences from which we will build the graph. We build an compressed suffix array (CSA) mapping sequence names to offsets in Q, and also the inverse using a rank/select dictionary on a bitvector marking the starts of sequences in Q. This allows us to map between positions in the sequences of Q, which is the format in which alignment algorithms typically express alignments, and positions in Q itself, which is the coordinate space we will use as a basis for the generation of our graph. To relate the sequences in Q to each other we apply a function map to generate alignments A. Although these alignments tend to be represented using oriented interval pairs in Q, for simplicity and robustness to graph complexity, we describe A as a vector of pairs of bidirectional positions (sequence offsets and strands) b in Q , such that A = [(bq, br), ... ]. We sort A by the first member (bq) of each pair, ensuring that the entries in A are ordered according to their order in Q.

To query the induced graph we build a rank/select dictionary allowing efficient traversal of A, based on a bit vector Abv of the same length as A such that we record a 1 at those positions which correspond to the first instance of a given bq and record a 0 in Abv otherwise. We record which bq we have processed in the bitvector Qseen which is of the same length as Q. This allows us to avoid a quadratic penalty in the order of the size of the transitive closures in Q given by the map function.

Now we inductively derive the graph implied by the alignments. For each base bq in Q, we find its transitive closure cq := [bq, br1, ... ] via the map operation by traversing the aligned base pairs recorded in A. We write the character of the base bq to a vector S, then for each bc in cq we record a pair [si, bc] into N and its reverse, [bc, si] into P. We mark Qseen for each base in each emitted cluster, and we do not consider marked bases in subsequent transitive closures. By sorting N and P by their first entries, we can build rank/select dictionaries on them akin to that we built on A that allow random access by graph base (as given in S) or input base (as given in Q).

To fully induce the variation graph we need to establish the links between bases in S that would be required for us to find any sequence in the input as a walk through the graph. We do so by rewriting Q (in both the forward and reverse orientation) in terms of pairs of bases in S, then sorting the resulting pairs by their first element, which yields L = [(ba, bb), ... ]. These pairs record the links and their frequencies, which we can emit or filter (such as by frequency) as needed given particular applications. In typical use we take the graph to be given by the unique elements of L.

Our data model encodes the graph using single-base nodes, but often downstream use requires identifying nodes and thus we benefit from compressing the unitigs of the graph into single nodes, which reduces memory used by identifiers in analysis. We can compress the node space of the graph by traversing S, and for each base querying the inbound links. Maintaining a bitvector Sid of length equal to S we mark each base at which we see any link other than one from or to the previous base on the forward or reverse strand, or at bases where we have no incoming links. By building a rank/select dictionary on Sid we can assign a smaller set of node ids to the sequence space of the graph.

Given the id space encoded by Sid we can materialize the graph in a variety of interchange formats, or provide id-based interfaces to the indexed squish graph. To generate graphs in .vg or GFAv1 format, we want to decompose the graph into its nodes (S), edges (L) and paths (P). The nodes are given by S and Sid, and similarly we project L and P through Sid to obtain a compressed variation graph.

using our graph

At present we rest on existing tools, such as vg, GCSA2, GBWT, and other GFAv1-compatible tools to support downstream uses for the graph. We observe that it will be trivial to use this data model as the basis for various graph traversal and modification algorithms. Implementation of these features will follow. We plan to link into the handle graph model developed in the vg project, which should allow the application of any algorithms defined using that interface on the backing graph developed here. Users familiar with concepts in assembly graphs will notice many similarities between the graph induction algorithm and typical steps in assembly algorithms, and we intend to explore these as well using this platform.

building

dependencies

You'll need basic C++ build tools, cmake, and zlib. On Ubuntu (and probably Debian) systems, these can be installed with:

sudo apt install build-essential cmake zlib1g-dev libjemalloc-dev

On Arch Linux, the jemalloc dependency can be installed with:

sudo pacman -S jemalloc     # arch linux

build process

seqwish uses cmake to build itself and its dependencies.

git clone --recursive https://github.com/ekg/seqwish.git
cd seqwish
cmake -H. -Bbuild && cmake --build build -- -j 3

To clean up simply remove build/ and bin/:

rm -rf build bin

A static build can be obtained by setting a flag in the cmake build setup.

cmake -DBUILD_STATIC=1 -H. -Bbuild && cmake --build build -- -j 3

You'll need to set this flag to 0 or remove and rebuild your build directory if you want to unset this behavior. Static builds are unlikely to be supported on OSX, and require appropriate static libraries on linux.

clang

If you want to use clang, be sure to install the correct version of OpenMP. For example, if you have clang version 14, you have to install libomp-14-dev:

sudo apt -y install libomp-14-dev

To build seqwish with clang, execute:

cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/usr/bin/clang -DCMAKE_CXX_COMPILER=/usr/bin/clang++ && cmake --build build -- -j 3

Notes for distribution and ARM64 systems

If you need machine-specific optimizations, use -DEXTRA_FLAGS to specify your architecture.

For example, on Linux ARM64 systems, do:

cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Generic -DEXTRA_FLAGS='-march=armv8-a' && cmake --build build -- -j 3

For Docker image distribution, -march=haswell works decently:

cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Generic -DEXTRA_FLAGS='-march=haswell' && cmake --build build -- -j 3

Docker

Alternatively, you may build a Docker image that contains seqwish.

docker build -t seqwish .

Bioconda

seqwish recipes for Bioconda are available at https://bioconda.github.io/recipes/seqwish/README.html. To install the latest version using Conda execute:

conda install -c bioconda seqwish

Guix

installing via the guix-genomics git repository

First, clone the guix-genomics repository:

git clone https://github.com/ekg/guix-genomics

And install the seqwish package to your default GUIX environment:

GUIX_PACKAGE_PATH=. guix package -i seqwish

Now seqwish is available as a global binary installation.

installing via the guix-genomics channel

Add the following to your ~/.config/guix/channels.scm:

  (cons*
(channel
  (name 'guix-genomics)
  (url "https://github.com/ekg/guix-genomics.git")
  (branch "master"))
%default-channels)

First, pull all the packages, then install seqwish to your default GUIX environment:

guix pull
guix package -i seqwish

If you want to build an environment only consisting of the seqwish binary, you can do:

guix environment --ad-hoc seqwish

For more details about how to handle Guix channels, go to https://git.genenetwork.org/guix-bioinformatics/guix-bioinformatics.git.

usage

seqwish supports PAF format output of several sequence aligners, like wfmash and minimap2. It requires the CIGAR string of the alignment to be provided in the cg:z: optional field. It uses large temporary files during the construction. By default, these are prefixed with the output GFA file name, but this can be changed with the -b[base], --base=[base] command line argument. The input sequences can be in FASTA or FASTQ format, either in plain text or gzipped. It writes GFA1 on its standard output.

wfmash

single PAF file

wfmash x.fa x.fa -X > x.paf
seqwish -s x.fa -p x.paf -g x.gfa

multiple PAF file

wfmash c.fa a.fa > a.paf
wfmash c.fa b.fa > b.paf
cat a.fa b.fa c.fa > abc.fa
seqwish -s abc.fa -p a.paf,b.paf -g abc.gfa

minimap2

minimap2 does not emit the CIGAR string in PAF output by default. To do this, specify the -c flag:

minimap2 x.fa x.fa -c -X > x.paf
seqwish -s x.fa -p x.paf -g x.gfa

TODO

  • describe algorithm
  • implement rank/select dictionary class based on disk-backed radix sort of a binary array
  • implement squish graph induction algorithm
  • explore extensions via graph rewriting, and the handle graph concept
  • explore assembly problems via graph filtering and cleaning operations