/GraphAligner

Primary LanguageC++MIT LicenseMIT

GraphAligner

Seed-and-extend program for aligning long error-prone reads to genome graphs. To cite, see https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02157-2. For a description of the bitvector alignment extension algorithm, see https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz162/5372677

Installation

Install via bioconda:

Compilation

Bioconda is the recommended installation method. If you however want to compile GraphAligner yourself, run these:

Note that miniconda is only required during compilation and not during runtime. After compilation you can run the binary without the miniconda environment or copy the binary elsewhere.

If you want to compile without miniconda, you will need to install boost, protobuf and protoc, sdsl, jemalloc and sparsehash.

Running

Quickstart: GraphAligner -g test/graph.gfa -f test/read.fa -a test/aln.gaf -x vg

See Parameters, the option GraphAligner --help and the subsections below for more information and options

Test example

The command above outputs the following alignment in test/aln.gaf:

read 71 0 71 + >1>2>4 87 3 73 67 72 60 NM:i:5 AS:f:56.3 dv:f:0.0694444 id:f:0.930556 cg:Z:4=1X2=1I38=1D5=1I5=1X13=

which aligned the read to the nodes 1,2,4 with an identity of 93%. See GAF format for more information about the output format. Alternatively try -a aln.gam for output compatible with vg.

The parameter -x vg uses a parameter preset for aligning reads to a variation graph. Other options are -x dbg for aligning to a de Bruijn graph.

File formats

GraphAligner's file formats are interoperable with vg's file formats. Graphs can be inputed either in .gfa format or .vg format. Reads are inputed as .fasta or .fastq, either gzipped or uncompressed. Alignments are outputed in GAF format or vg's alignment format, either as a binary .gam or JSON depending on the file name. Custom seeds can be inputed in .gam format.

Seed hits

GraphAligner has three built-in methods for finding seed hits: minimizers (default), maximal unique matches (MUMs) and maximal exact matches (MEMs). Only matches entirely within a node are found. Minimizers (default) are faster and MUM/MEMs can be more sensitive. MUM/MEM modes use MEMfinder to find matches between the read and nodes. Use the parameter --seeds-mum-count n to use the n longest MUMs as seeds (or -1 for all MUMs), and --seeds-mem-count n for the n longest MEMs (or -1 for all MEMs). Use --seeds-mxm-length n to only use matches at least n characters long. If you are aligning multiple files to the same graph, use --seeds-mxm-cache-prefix file_name_prefix to store the MUM/MEM index to disk for reuse instead of rebuilding it each time.

Alternatively you can use any method to find seed hits and then import the seeds in .gam format with the parameter -s seedfile.gam. The seeds must be passed as an alignment message, with path.mapping[0].position describing the position in the graph, name the name of the read and query_position the position in the forward strand of the read. Match length (path.mapping[0].edit[0].from_length) is only used to order the seeds, with longer matches tried before shorter matches.

Alternatively you can use the parameter --seeds-first-full-rows to use the dynamic programming alignment algorithm on the entire first row instead of using seeded alignment. This is very slow except on tiny graphs, and not recommended.

Extension

GraphAligner uses a bitvector banded DP alignment algorithm to extend the seed hits. The DP matrix is calculated inside a certain area (the band), which depends on the extension parameters. Note that "bandwidth" in graph alignment does NOT directly correspond to bandwidth in linear alignment. The bandwidth parameter describes the maximum allowed score difference between the minimum score in a row and a cell, with cells whose score is higher than that falling outside the band. Bandwidth higher than 35 is not recommended for complex graphs due to huge increases in runtime but might work for variation graphs.

The algorithm starts using the initial bandwidth. Should it detect that the alignment is incorrect, it will rewind and rerun with the ramp bandwidth parameter, aligning high-error parts of the read without slowing down alignment in low-error parts. The tangle effort parameter determines how much time GraphAligner spends inside complex cyclic subgraphs. If the size of the band grows beyond the tangle effort parameter, GraphAligner will use the current best alignment for the aligned prefix and move forward along the read. This might miss the optimal alignment.

Parameters

  • -g input graph. Format .gfa / .vg
  • -f input reads. Format .fasta / .fastq / .fasta.gz / .fastq.gz. You can input multiple files with -f file1 -f file2 ... or -f file1 file2 ...
  • -t number of aligner threads. The program also uses two IO threads in addition to these.
  • -a output file name. Format .gam or .json
  • -x parameter preset. Use -x vg for aligning to variation graphs and other simple graphs, and -x dbg for aligning to de Bruijn graphs.

All parameters below are optional.

  • --precise-clipping use arg as the identity threshold for a valid alignment. Recommended to be less than the accuracy of the reads, for example 0.75 for ONT, 0.9 for HiFi, 0.95 for assembly-to-assembly.
  • --min-alignment-score discard alignments whose score is less than this.
  • --multimap-score-fraction alignment score fraction for including secondary alignments. Alignments whose alignment score is less than arg as a fraction of the best scoring overlapping alignment per read are discarded. Lower values include more poor secondary alignments and higher values less.

Seeding:

  • --seeds-minimizer-density For a read of length n, use the arg * n most unique seeds
  • --seeds-minimizer-length k-mer size for minimizer seeds
  • --seeds-minimizer-windowsize Window size for minimizer seeds
  • --seeds-mum-count MUM seeds. Use the n longest maximal unique matches. -1 for all MUMs
  • --seeds-mem-count MEM seeds. Use the n longest maximal exact matches. -1 for all MEMs
  • --seeds-mxm-length MUM/MEM minimum length. Don't use MUMs/MEMs shorter than n
  • --seeds-mxm-cache-prefix MUM/MEM file cache prefix. Store the MUM/MEM index into disk for reuse. Recommended unless you are sure you won't align to the same graph multiple times

Extension:

  • -b alignment bandwidth. Unlike in linear alignment, this is the score difference between the minimum score in a row and the score where a cell falls out of the band. Values recommended to be between 1-35.
  • -C tangle effort. Determines how much effort GraphAligner spends on tangled areas. Higher values use more CPU and memory and have a higher chance of aligning through tangles. Lower values are faster but might return an inoptimal or a partial alignment. Use for complex graphs (eg. de Bruijn graphs of mammalian genomes) to limit the runtime in difficult areas. Values recommended to be between 1'000 - 500'000.