/MEMSA

Multiple sequence alignment tool that builds on the concept of seed-and-extend algorithms

Primary LanguageC++

MEMSA - A MEM Extracting Multiple Sequence Aligner

MEMSA is a mutltiple sequence alignment (MSA) tool, which identifies maximum exact matches (MEMs) before applying a traditional MSA algorithm, in order to speed up the alignment process. It was developed to investigate the effects of this heuristic on computation time and alignment accuracy and demonstrates that the preprocessing step can indeed positively impact the alignment of genomic sequences.

Requirements

This tool was developed for MacOS and Linux. In order to built it, the gcc compiler needs to be installed.

Manual

The tool can be executed by putting a single sequence in the reference FASTA-file and putting the all other sequences to be aligned in the input FASTA-file. The reference sequence can be picked arbitrarily from the dataset, as the choice of the reference does not affect the alignment. The generated alignment will be written into the output file.

Install

./install.sh

The installation script downloads the required dependencies slaMEM and MAFFT and builds an executable from the source code.

Usage

./memsa (<options>)

To run MEMSA for the provided example files and default parameters, just run ./memsa

Options:

  • -s : minimum seed length (default=20)
  • -g : maximum merge gap (default=1)
  • -r : reference file name (default="reference.fa")
  • -i : input file name (default="input.fa")
  • -o : output file name (default="alignment.fa")

Example:

./memsa -s 5 -g 0
./memsa -r ref.fa -i sequences.fa -o result.fa

For the (extremely simple) example files provided, one can observe that for a minimum seed length -s of 5-8, MEMSA finds exactly one common seed across all sequences. For smaller seed sizes, the potential seeds are not consistent (duplicate and/or out of order) whereas for larger seed sizes, not a single seed present in all sequences is found.