This project is aimed at improving a part of HAlign2.0, which could align multiple nucleotide sequences fast and accurately.
HAlign is a cross-platform program that performs multiple sequence alignment based on the center star strategy. Here, we present two major updates of HAlign 3, which help improve the time efficiency and the alignment quality. 1) The suffix tree data structure is specifically modified to fit the nucleotide sequence: Left-child right-sibling is replaced by a K-ary tree to build the suffix tree to reach a higher common substring searching efficiency at a small cost in memory usage; 2) a global substring selection algorithm combining directed acyclic graphs with dynamic programming is adopted to screen out the unsatisfactory common substrings. These improvements make HAlign 3 a specialized program to deal with ultra-large numbers of similar DNA/RNA sequences, such as closely related viral or prokaryotic genomes. HAlign 3 can be easily installed via the Anaconda and Java release package on macOS, Linux, Windows subsystem for Linux, and Windows systems, and the source code is available on GitHub (https://github.com/malabz/HAlign-3).
1.Intall WSL for Windows. Instructional video 1 or 2 (Copyright belongs to the original work).
2.Download and install Anaconda. Download Anaconda versions for different systems from here. Instructional video of anaconda installation 1 or 2 (Copyright belongs to the original work).
3.Install HAlign 3.
#1 Acvtivate one of you conda environment
conda activate base
#2 Add channels to conda
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels malab
#3 Install the required package openjdk=11 for running halign
conda install -c conda-forge openjdk=11
#4 Install halign
conda install -c malab halign
#5 Test halign
halign -h
halign [options] <infile>
positional argument:
infile nucleotide sequences in fasta format
optional arguments:
-o <filename> output aligned file, with option (-s) on, sequence identifiers will not be outputted
-t <integer> multi-thread, with a default setting of half of the cores available
-c <integer> centre sequence index (0-based), (default: the longest sequence)
-Xmx<size> set maximum Java heap size, such as "-Xmx512g" used for the alignment of 1 million SARS-CoV-2 sequences; the default maximum Java heap size varies on different machine, which can be checked by command "java -XX:+PrintFlagsFinal -version | grep MaxHeapSize"
-s output alignments without sequence identifiers, i.e. in plain txt format but with sequence order retained, (default: off)
-h produce help message and exit
-v produce version message and exit
1.Download testdata.
2.Run HAlign 3.
Align mt_genome.fasta dataset by HAlign 3 with the setting of: 5 threads for paralization alignment, the 7th sequence as center sequence during alignment, output alignment block with sequence identifiers.
# Download dataset
conda install wget # for retrieving files using HTTP, HTTPS and FTP
wget http://lab.malab.cn/%7Etfr/HAlign3_testdata/mt_genome.tar.xz
# Uncompress dataset
tar -Jxf mt_genome.tar.xz
# Run halign
halign -o mt_genome.fasta.aln -t 5 -c 6 mt_genome.fasta
# Check alignment result
conda install seqkit # for manipulating FASTA/Q file.
seqkit stat mt_genome.fasta mt_genome.fasta.aln
# file format type num_seqs sum_len min_len avg_len max_len
# mt_genome.fasta FASTA DNA 672 11,134,166 16,555 16,568.7 16,578
# mt_genome.fasta.aln FASTA DNA 672 11,172,000 16,625 16,625 16,625
1.Download and install JDK 11 for different systems from here.
2.Download HAlign3 from relseases.
java -X[options] -jar HAlign-3.0.0_rc1.jar [options] <infile>
positional argument:
infile nucleotide sequences in fasta format
optional arguments:
-o <filename> output aligned file, with option (-s) on, sequence identifiers will not be outputted
-t <integer> multi-thread, with a default setting of half of the cores available
-c <integer> centre sequence index (0-based), (default: the longest sequence)
-Xmx<size> set maximum Java heap size, such as "-Xmx512g" used for the alignment of 1 million SARS-CoV-2 sequences; the default maximum Java heap size varies on different machine, which can be checked by command "java -XX:+PrintFlagsFinal -version | grep MaxHeapSize"
-s output alignments without sequence identifiers, i.e. in plain txt format but with sequence order retained, (default: off)
-h produce help message and exit
-v produce version message and exit
1.Download testdata. Uncompress dataset by WinRAR for Windows.
2.Run HAlign 3.
Align mt_genome.fasta dataset by HAlign 3 with the setting of: 5 threads for paralization alignment, the 7th sequence as center sequence during alignment, output alignment block without sequence identifiers.
# Download dataset
wget http://lab.malab.cn/%7Etfr/HAlign3_testdata/mt_genome.tar.xz
# Uncompress dataset
tar -Jxf mt_genome.tar.xz
# Run halign
java -jar HAlign-3.0.0_rc1.jar -o mt_genome.fasta.aln -t 5 -c 6 -s mt_genome.fasta
HAlign 3 is a free software, licensed under MIT.
HAlign 3 is supported by ZOU's Lab. If you have any questions and suggestions, please feel free to contact us on the issues page. You are also welcomed to send a copy to Furong.TANG@uestc.edu.cn or Furong.TANG@hotmail.com to make sure we could answer you ASAP!
If you use this software please cite:
HAlign 3: Fast multiple alignment of ultra-large numbers of similar DNA/RNA sequences. Molecular Biology and Evolution (2022)
- 28-07-2020
pairwise-alignment scoring rules adjusted
- 11-07-2020
optimizations
- 16-12-2019
refactoring and a new feature added
- 15-12-2019
SuffixTree.java
rewritten to improve the performance
- 20-10-2019
initialisation
- JDK 11
- Maven