/EnanoFASTQ

A FASTQ lossless compression algorithm especially designed for nanopore sequencing FASTQ files.

Primary LanguageC++MIT LicenseMIT

ENANO FASTQ

An encoder for nanopore FASTQ files

Description

ENANO is a FASTQ lossless compression algorithm especially designed for nanopore sequencing FASTQ files. We tested ENANO and current state-of-the-art compressors on several publicly available nanopore datasets. The results show that our algorithm consistently achieves the best compression performance on every nanopore dataset, while being computationally efficient in terms of speed and memory requirements when compared to existing alternatives.

Install with Conda

To install directly from source, follow the instructions in the next section.

Enano is available on conda via the bioconda channel. See this page for installation instructions for conda. Once conda is installed, do the following to install enano.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install enano

Note that if enano is installed this way, it should be invoked with the command enano rather than ./enano. The bioconda help page shows the commands if you wish to install enano in an environment.

Install from source code

Download repository

git clone https://github.com/guilledufort/EnanoFASTQ.git

Requirements

  1. g++ ( >= 4.8.1)
  2. OpenMP library

Install

The following instructions will create the enano executable in the directory enano. To compile enano you need to have the g++ compiler and the OpenMP library for multithreading.

On Linux (Ubuntu or CentOS) g++ usually comes installed by default, but if not run the following:

sudo apt update
sudo apt-get install g++

On macOS, install GCC compiler since Clang has issues with OpenMP library:

  • Install HomeBrew (https://brew.sh/)
  • Install GCC (this step will be faster if Xcode command line tools are already installed using xcode-select --install):
brew update
brew install gcc@9

The g++ installer also installs the OpenMP library, so no further steps are needed. To check if the g++ compiler is properly installed in your system run:

On Linux

g++ --version

On MacOS:

g++-9 --version

The output should be the description of the installed software.

To compile enano run:

cd EnanoFASTQ/enano
make

USAGE

Run the enano executable /PATH/TO/enano (or just enano if installed with conda) with the options below:

To compress:
  enano [options] [input_file [output_file]]

    -c             To use MAX COMPRESION MODE. Default is FAST MODE.

    -k <length>    Basecall sequence context length. Default is 7 (max 13).

    -l <lenght>    Length of the DNA neighborhood sequence used in the quality score context. Default is 6.

    -t <num>       Maximum number of threads allowed to use by the compressor. Default is 8.

To decompress:
   enano -d [options] foo.enano foo.fastq
    -t <num>       Maximum number of threads allowed to use by the decompressor. Default is 8.

Datasets information

To test our compressor we ran experiments on the following datasets. The full information of the datasets is on our publication.

Dataset Num. of files size (GB) Description Link
sor* 4 124.071 Sorghum bicolor Tx430 https://www.nature.com/articles/s41467-018-07271-1#data-availability
bra* 18 43.014 Doubled haploid canola (Brassica napus L.) https://www.nature.com/articles/s41598-019-45131-0#data-availability
lun 13 15.239 Human lung bacterial metagenomic https://www.nature.com/articles/s41587-019-0156-5#data-availability
joi 9 4.672 Infected orthopaedic devices metagenomic https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-5094-y
vir* 10 4.375 Direct RNA sequencing (HSV-1) https://www.nature.com/articles/s41467-019-08734-9#data-availability
hs1 1 249.791 Human GM12878 Utah/Ceph cell line https://github.com/nanopore-wgs-consortium/NA12878
hs2^ 50 193.920 Human GM12878 Utah/Ceph cell line https://www.nature.com/articles/s41467-019-09637-5#data-availability
npd* 336 113.440 Multiple organisms https://github.com/guidufort/DualFqz

*Datasets that require the SRA toolkit to be downloaded.

^We only used the first 50 files of the dataset.

Downloading the datasets

To download a dataset you have to run the download_script.sh of the specific dataset. For example, to download sor run:

cd EnanoFASTQ
dataset/sor/download_script.sh

The scripts use the command wget to perform the download. To install wget on macOS run:

brew install wget

To install wget on Ubuntu or CentOS run:

sudo apt-get install wget

Some datasets require the SRA toolkit (2.9.6-1 release) to be downloaded. To install the SRA toolkit you can follow the instructions here https://ncbi.github.io/sra-tools/install_config.html, and place the toolkit's root-folder under the EnanoFASTQ directory, or you can run one of the scripts we provide. There is a different script for each OS, so you have to choose the one corresponding to your OS. For example, to install the SRA toolkit on macOS you can run:

cd EnanoFASTQ
./install_SRA_mac.sh

Examples

If installed using conda, use the command enano instead of enano/enano.

Compress using ENANO

To run the compressor with 4 threads on the example file:

cd EnanoFASTQ
enano/enano -k 8 -l 5 -t 4 example/SAMPLE.fastq example/SAMPLE.enano

Decompress using ENANO

To decompress with 8 threads the example compressed file:

cd EnanoFASTQ
enano/enano -d -t 8 example/SAMPLE.enano example/SAMPLE_dec.fastq

Check if decoding is successful

The output has to be empty.

cmp example/SAMPLE.fastq example/SAMPLE_dec.fastq

Credits

The methods used for encoding the reads names, model frequency counters, and to do the reads parsing, are the ones proposed by James Bonefield in FQZComp, with some modifications. The range coder is derived from Eugene Shelwien.