A pairwise sequence aligner using A*, extending the seed heuristic from AStarix.
Co-authored by @pesho-ivanov and @RagnarGrootKoerkamp.
- Links
-
- Twitter: @curious_coding, @peshotrie,
- Blog: https://research.curiouscoding.nl
- Citation
- Ragnar Groot Koerkamp, Pesho Ivanov. “Exact pairwise alignment using A* with seed heuristic and match pruning”. bioRxiv (2022). DOI 2022.09.19.508631
- Usage
-
cargo run --release -- -i <input>.{seq,txt,fasta} [-a csh] [-s]
Edit distance is shown in the
ed
column. The Cigar string is computed but not written to a file currently (#15). - Visualizations
-
Show a video of a small alignment:
cargo run --release --features sdl2 -- -i <input> -v [--style detailed]
Save a large alignment to disk:
cargo run --release --features sdl2 -- -i <input> --save --save-path <filename>.bmp --style large
- Rust/C library usage
- This is TODO (#14).
- Feedback/Issues
- Any kind of feedback and questions on the paper and code is welcome in the issues here. We’re happy to help out with any kind of problems you may run into.
- Disclaimer
-
While our method works well on random sequences with random errors, it can be
very slow on alignments containing long indels. Over $100×$ slower than
WFA happens regularly. Use the unpublished
--dt --gap-cost
flags for possible speed improvements in this case. If A*PA is still slow, feel free to make an issue containing a.seq
file and visualization. (It won’t speed up existing code, but may inspire new ideas.)
Here’s an alignment of two sequences of length
- Install rustup.
- Install Rust Nightly using
rustup install nightly
- Make nightly the default:
rustup default nightly
- Clone the repository.
- Run
cargo run -- -h
for the help below - Check out
cargo run -- --help
for more flags and features not in the paper.
USAGE:
astar-pairwise-aligner [OPTIONS] <--input <INPUT>|--length <LENGTH>>
OPTIONS:
-h, --help Print help information
-o, --output <OUTPUT> Where to write optional statistics
-s, --silent Print less. Pass twice for summary line only
INPUT:
-i, --input <INPUT> The .seq, .txt, or Fasta file with sequence pairs to align
-x, --cnt <CNT> The number of sequence pairs to generate [default: 1]
-n, --length <LENGTH> Length of generated sequences
-e, --error-rate <ERROR_RATE> Input error rate
--seed <SEED> Seed to initialize RNG for reproducability
ALGORITHM:
-a, --algorithm <ALGORITHM> [default: astar] [possible values: nw, dt, astar, nw-lib, nw-lib-simd, edlib, wfa, biwfa]
HEURISTIC:
-H, --heuristic <HEURISTIC> [default: sh] [possible values: dijkstra, sh, csh]
-k <k> Seed length [default: 15]
-r <r> Seed potential [default: 2]
VISUALIZER:
-v, --visualize [<WHEN>...] Run the interactive visualizer. See --help for controls. [default: all] [possible values: none, last, all, layers]
--style <STYLE> Visualizer style [default: default] [possible values: default, large, detailed]
--save [<WHEN>...] Which frames to save [possible values: none, last, all, layers]
--save-path <PATH> Where to save
Input can be a file (--input file.{fasta,txt,seq}
) or randomly generated (--length <n> --error-rate <e>
).
The heuristic can be changed between seed heuristic (-H sh
) and chaining
seed heuristic (-H csh
).
Parameters -r
and -k
can be specified if non-default values are needed.
- There is no option yet output alignments. Edit distances are shown
in the terminal output and
--output
file in theed
column towards the end.I’ll add it eventually, but make an issue if needed; it’s not much work.
- There is no standalone library/API yet. I’ll add something soon (#1).
To align all consecutive pairs in a file:
cargo run --release -- -i <path/to/file.{fasta,fa,txt,seq}>
To run on
cargo run --release -- -x 100 -n 100000 -e 0.05
To generate a .seq
dataset:
cargo run --release --bin generate_dataset -- -x 1000 -n 10000 -e 0.05 /tmp/random-sequences.seq
- Pass
--dt
to run diagonal-transition based A*. This can give up to$5$ times speedup. - Pass
--gap-cost
to improve the chaining seed heuristic with gap costs. This improves runtime when the alignment contains long indels. - Pass
--kmin <kmin>
,--kmax <kmax>
, and--max-matches <num>
to use variable length seeds with at most the given number of matches and between the given lengths. - Pass
--skip-prune <N>
to skip pruning everyN
‘th match that would otherwise be pruned. This can speed up pruning when there are a lot of matches.
The results in our preprint are entirely reproducible using the makefile:
- Synthetic data (
#evals-sythetic
tag) -
- Clone WFA and Edlib repos using
make wfa
andmake edlib
from this directory. - Run the synthetic evals (evals/Snakefile,
5h
) usingmake evals
to write data tables to evals/table/. Data used in the paper is already committed. - Write plots to evals/results/ using
make results
or evals/results.py directly. An interactive notebook is at evals/evals.ipynb.
- Clone WFA and Edlib repos using
- Human data (
#evals-human
tag) -
- Download the datasets from the sequence data release and unzip the
files to
evals/human/{chm13,na12878}/seq<id>.seq
.The CHM13 set was created using the steps in evals/human/Snakefile.
- Run
make evals-human
(1-2h
) andmake results
.
- Download the datasets from the sequence data release and unzip the
files to
Visualizations require the sdl2
feature flag to be enabled, either via
cargo run --features sdl2
or by enabling then as default in Cargo.toml.
Reimplementations of the following algorithms can be visualized:
- Needleman-Wunsch (
nw
) -
- Pass
--exp-search
to enable exponential search on band, as in Edlib.
- Pass
- Diagonal-Transition (
dt
) -
- Pass
--dc
to enable divide & conquer, as in BiWFA.
- Pass
- A* (
astar
) -
- Choose heuristic with
-H {dijkstra,sh,csh}
. - Pass parameters with
-r {1,2} -k <k>
. - Pass
--dt
to enable diagonal transition based A*. - Pass
--gap-cost
for an improved version of the chaining seed heuristic.
- Choose heuristic with
Visualizer options:
-v {all,last,layers}
- Visualize either all frames, only the last frame, or one frame per layer (ie NW column, DT wavefront, or A* value of $f$).
--save {all,last,layers}
-
Save the corresponding frame as
.bmp
in the directory given by--save-path
. These can be manually turned into gifs. --style {default,detailed,large}
- Choose between different visualizer presets.
detailed
includes the heuristic, andlarge
scales things down a factor100
to render long alignments. (Use-v last
in this case.) --cell-size <size>
- Override the cell size (number of pixels per cell).
--downscaler <scale>
- Override the number of states drawn per cell.
Sample videos corresponding to figure 1 of the paper are below. Due to different visualization strategies (per layer, per cell) timings are not at all comparable.
Dijkstra | Ukkonen’s exponential search (Edlib) |
Diagonal transition (WFA) | DT + Divide & Conquer (BiWFA) |
A* with CSH and pruning (A*PA) |
And here is a video of figure 3 of our preprint:
- Tests
-
Code is tested for correctness in various tests (tests/, src/aligners/tests/)
against library implementation of edit distance, and against Edlib and WFA
when the
edlib
andwfa
feature flags are set. - Benchmarks
- All code is benchmarked on GitHub Actions CI. Performance history of benchmarks is here.
MPL-2.0