APA: A Pairwise Aligner

A pairwise sequence aligner using A*, extending the seed heuristic from AStarix.

Co-authored by @pesho-ivanov and @RagnarGrootKoerkamp.

Links

Twitter: @curious_coding, @peshotrie,
Blog: https://research.curiouscoding.nl

Citation

Ragnar Groot Koerkamp, Pesho Ivanov. “Exact pairwise alignment using A* with seed heuristic and match pruning”. bioRxiv (2022). DOI 2022.09.19.508631

Usage

cargo run --release -- -i <input>.{seq,txt,fasta} [-a csh] [-s]

Edit distance is shown in the ed column. The Cigar string is computed but not written to a file currently (#15).

Visualizations

Show a video of a small alignment:

cargo run --release --features sdl2 -- -i <input> -v [--style detailed]

Save a large alignment to disk:

cargo run --release --features sdl2 -- -i <input> --save --save-path <filename>.bmp --style large

Rust/C library usage

This is TODO (#14).

Feedback/Issues

Any kind of feedback and questions on the paper and code is welcome in the issues here. We’re happy to help out with any kind of problems you may run into.

Disclaimer

While our method works well on random sequences with random errors, it can be very slow on alignments containing long indels. Over $100×$ slower than WFA happens regularly. Use the unpublished --dt --gap-cost flags for possible speed improvements in this case. If A*PA is still slow, feel free to make an issue containing a .seq file and visualization. (It won’t speed up existing code, but may inspire new ideas.)

Here’s an alignment of two sequences of length $500$ with $30\%$ error rate using the chaining seed heuristic with pruning:

Getting started

Installing Rust

Install rustup.
Install Rust Nightly using rustup install nightly
Make nightly the default: rustup default nightly

Usage

Clone the repository.
Run cargo run -- -h for the help below
Check out cargo run -- --help for more flags and features not in the paper.

USAGE:
    astar-pairwise-aligner [OPTIONS] <--input <INPUT>|--length <LENGTH>>

OPTIONS:
    -h, --help               Print help information
    -o, --output <OUTPUT>    Where to write optional statistics
    -s, --silent             Print less. Pass twice for summary line only

INPUT:
    -i, --input <INPUT>              The .seq, .txt, or Fasta file with sequence pairs to align
    -x, --cnt <CNT>                  The number of sequence pairs to generate [default: 1]
    -n, --length <LENGTH>            Length of generated sequences
    -e, --error-rate <ERROR_RATE>    Input error rate
        --seed <SEED>                Seed to initialize RNG for reproducability

ALGORITHM:
    -a, --algorithm <ALGORITHM>    [default: astar] [possible values: nw, dt, astar, nw-lib, nw-lib-simd, edlib, wfa, biwfa]

HEURISTIC:
    -H, --heuristic <HEURISTIC>    [default: sh] [possible values: dijkstra, sh, csh]
    -k <k>                         Seed length [default: 15]
    -r <r>                         Seed potential [default: 2]

VISUALIZER:
    -v, --visualize [<WHEN>...]    Run the interactive visualizer. See --help for controls. [default: all] [possible values: none, last, all, layers]
        --style <STYLE>            Visualizer style [default: default] [possible values: default, large, detailed]
        --save [<WHEN>...]         Which frames to save [possible values: none, last, all, layers]
        --save-path <PATH>         Where to save

Input can be a file (--input file.{fasta,txt,seq}) or randomly generated (--length <n> --error-rate <e>).

The heuristic can be changed between seed heuristic (-H sh) and chaining seed heuristic (-H csh). Parameters -r and -k can be specified if non-default values are needed.

Notes

There is no option yet output alignments. Edit distances are shown in the terminal output and --output file in the ed column towards the end.
I’ll add it eventually, but make an issue if needed; it’s not much work.
There is no standalone library/API yet. I’ll add something soon (#1).

Examples

To align all consecutive pairs in a file:

cargo run --release -- -i <path/to/file.{fasta,fa,txt,seq}>

To run on $100$ random sequences of length $10^5$ with error rate $5\%$:

cargo run --release -- -x 100 -n 100000 -e 0.05

To generate a .seq dataset:

cargo run --release --bin generate_dataset -- -x 1000 -n 10000 -e 0.05 /tmp/random-sequences.seq

Work-in-progress features

Pass --dt to run diagonal-transition based A*. This can give up to $5$ times speedup.
Pass --gap-cost to improve the chaining seed heuristic with gap costs. This improves runtime when the alignment contains long indels.
Pass --kmin <kmin>, --kmax <kmax>, and --max-matches <num> to use variable length seeds with at most the given number of matches and between the given lengths.
Pass --skip-prune <N> to skip pruning every N‘th match that would otherwise be pruned. This can speed up pruning when there are a lot of matches.

Evals

The results in our preprint are entirely reproducible using the makefile:

Synthetic data (#evals-sythetic tag)

Clone WFA and Edlib repos using make wfa and make edlib from this directory.
Run the synthetic evals (evals/Snakefile, 5h) using make evals to write data tables to evals/table/. Data used in the paper is already committed.
Write plots to evals/results/ using make results or evals/results.py directly. An interactive notebook is at evals/evals.ipynb.

Human data (#evals-human tag)

Download the datasets from the sequence data release and unzip the files to evals/human/{chm13,na12878}/seq<id>.seq.
The CHM13 set was created using the steps in evals/human/Snakefile.
Run make evals-human (1-2h) and make results.

Visualizations

Visualizations require the sdl2 feature flag to be enabled, either via cargo run --features sdl2 or by enabling then as default in Cargo.toml.

Reimplementations of the following algorithms can be visualized:

Needleman-Wunsch (nw)

Pass --exp-search to enable exponential search on band, as in Edlib.

Diagonal-Transition (dt)

Pass --dc to enable divide & conquer, as in BiWFA.

A* (astar)

Choose heuristic with -H {dijkstra,sh,csh}.
Pass parameters with -r {1,2} -k <k>.
Pass --dt to enable diagonal transition based A*.
Pass --gap-cost for an improved version of the chaining seed heuristic.

Visualizer options:

-v {all,last,layers}: Visualize either all frames, only the last frame, or one frame per layer (ie NW column, DT wavefront, or A* value of $f$).
--save {all,last,layers}: Save the corresponding frame as .bmp in the directory given by --save-path. These can be manually turned into gifs.
--style {default,detailed,large}: Choose between different visualizer presets. detailed includes the heuristic, and large scales things down a factor 100 to render long alignments. (Use -v last in this case.)
--cell-size <size>: Override the cell size (number of pixels per cell).
--downscaler <scale>: Override the number of states drawn per cell.

Sample videos corresponding to figure 1 of the paper are below. Due to different visualization strategies (per layer, per cell) timings are not at all comparable.

Dijkstra	Ukkonen’s exponential search (Edlib)
Diagonal transition (WFA)	DT + Divide & Conquer (BiWFA)
A* with CSH and pruning (A*PA)

And here is a video of figure 3 of our preprint:

Tests & Benchmarks

Tests: Code is tested for correctness in various tests (tests/, src/aligners/tests/) against library implementation of edit distance, and against Edlib and WFA when the edlib and wfa feature flags are set.
Benchmarks: All code is benchmarked on GitHub Actions CI. Performance history of benchmarks is here.

License

MPL-2.0

wheaton5/astar-pairwise-aligner

A*PA: A* Pairwise Aligner