Edist

Julia implementation of the Needleman-Wunsch pairwise sequence alignment algorithm, along with the Hirschberg space-efficient divide-and-conquer version of the algorithm and a heuristic implementation that approximates the score for the alignment of two sequences.

Source Structure

The project can be loaded into the Julia environment by running

julia --project=.

inside the project root directory. The source code can be exposed and precompiled in the global namespace with using Edist

The project has 3 modules, Full, Hirschberg, and Bounded corresponding to the full dynamic programming implementation, Hirschberg divide and conquer, and spatially bounded heuristic. For the most part these internal implementations can be ignored aside from specific parameter tuning.

The main functionality is exposed through the align and score functions, which serve as a wrapper around the various submodules to expose alignment and scoring in an implementation-agnostic way. Both take a module name as the first argument, as well as two strings, and returns the alignment/score generated by the implementation specified in the module name, e.g.

julia> align(Bounded, "CACTAG", "ATCA")
(score = -4, seq_alignment = "CACTAG", query_alignment = "-A-TCA", memory_used = 376)

score functions similarly but only returns the score

Directory Structure

.
├── data
│   ├── graphics
│   └── TP53_cross_species.fasta
├── docs
├── Manifest.toml
├── nbs
│   └── Analysis.ipynb
├── Project.toml
├── README.md
├── src
│   ├── Bounded.jl
│   ├── Edist.jl
│   ├── Full.jl
│   └── Hirschberg.jl
└── test
  • data contains any data sources for the code, in this case a FASTA file containing coding sequences for the TP53 protein across species
  • docs contains \LaTeX source and/or PDF slide decks papers documenting research and presentation thereof
  • nbs contains jupyter notebooks for analysis of the project
  • src contains the project source code