Improved FASTQ compressor: SPRING
HARC (HAsh-based Read Compressor) - Tool for compression of genomic reads in FASTQ format. Compresses only the read sequences. Achieves near-optimal compression ratios and fast decompression. Supports upto 4.29 Billion fixed-length reads with lengths at most 256. Requires around 50 Bytes of RAM/read for read length 100 during compression. The algorithm requires C++11 and g++ compiler and works on Linux. p7zip should already be installed.
git clone https://github.com/shubhamchandak94/HARC.git
cd HARC
./install.sh
./harc -c FASTQ_file [-p] [-t num_threads] [-q]
-p = Preserve order of reads (compression ratio 2-4x worse if order preserved)
-t num_threads - default 8
-q = Write quality values and read IDs to .quality and .id files, respectively.
Quality values and read IDs are appropriately reordered if -p is not specified.
./harc -d HARC_file [-p] [-t num_threads] [-m max_memory]
-p = Get reads in original order (slower). Only applicable if -p was used during compression.
-t num_threads - default 8
-m max_memory - Controls memory-time tradeoff for decompression with -p.
Specify max memory in GB (minimum 3 GB for 8 threads). e.g. -m 10 for 10 GB maximum memory.
Default: 7 GB (note: less than 3 GB memory required if -p not specified)
./harc -h
For compressing file.fastq in the HARC home directory without preserving order using default 8 threads,
./harc -c file.fastq
The compressed file is located in the same directory and is named file.harc.
To decompress file.harc to file.dna.d containing the reads (not in the original order) using 4 threads,
./harc -d file.harc -t 4
For compressing file.fastq while preserving its order, run
./harc -c file.fastq -p
Now, file.harc also contains the order information and it can be decompressed with or without the -p flag. If -p flag is not used, the reads will be decompressed faster but not in the original order. To decompress and restore the original order, run
./harc -d file.harc -p
Decompression with -p flag also allows the user to set the maximum memory used during the order restoration step by using the -m flag, e.g., to use at most 10 GB memory for that step while decompressing, run
./harc -d file.harc -p -m 10
Note, however, that the BSC decompression step uses 350 MB/thread (both with and without -p) irrespective of the -m switch. This component can be controlled by reducing the number of threads. Also, maximum threads used by HARC during decompression is limited by the number of threads used during compression.
To preserve the quality and read identifiers, use -q switch during compression, e.g.,
./harc -c file.fastq -q
This will write the quality values to file.quality and read identifiers to file.id in the same directory as file.fastq. Since the command was run without the -p flag, the quality values and the IDs will be reordered to match the new order of the reads. Quality value compressor QVZ (available at https://github.com/mikelhernaez/qvz) can be used to directly compress file.quality.
wget -b ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR554/SRR554369/SRR554369_1.fastq.gz
wget -b ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR554/SRR554369/SRR554369_2.fastq.gz
gunzip SRR554369_1.fastq.gz SRR554369_2.fastq.gz
cat SRR554369_1.fastq SRR554369_2.fastq > SRR554369.fastq
For some datasets (e.g. SRR327342 and SRR870667), the two fastq files may have reads of different lengths. To use HARC on such datasets, compress the two fastq files separately.
wget -b http://public.genomics.org.cn/BGI/gutmeta/High_quality_reads/MH0001/081026/MH0001_081026_clean.1.fq.gz
wget -b http://public.genomics.org.cn/BGI/gutmeta/High_quality_reads/MH0001/081026/MH0001_081026_clean.2.fq.gz
gunzip MH0001_081026_clean.1.fq.gz MH0001_081026_clean.2.fq.gz
cat MH0001_081026_clean.1.fq MH0001_081026_clean.2.fq > MH0001_081026_clean.fq
wget -b ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
Inside this file different chromosomes are demarcated.
cd util/gen_fastq_noRC
make
./gen_fastq_noRC 35000000 100 PATH/chrom22.fasta PATH/chrom22_reads.fastq
Reads with 1% uniform substitution rate (each base is equally likely to be changed to any of the 4 possibilities e.g. A - C,T,G,N) (without reverse complementation) - 35M reads of length 100 from chrom 22
./gen_fastq_noRC 35000000 100 PATH/chrom22.fasta PATH/chrom22_reads.fastq -e
@seq id
read
+
quality score
The directory util/ contains quality_counts.cpp and noise_entropy.py, which can be used to compute the noise entropy upper bound using the method described in the Supplementary Data (https://github.com/shubhamchandak94/HARC/blob/master/supplementary-data.pdf). To use these,
- Write the quality values (every fourth line of the FASTQ file) to a separate file, e.g., by using
sed -n '4~4p' file.fastq > file.quality
- Modify the read length in the header of quality_counts.cpp and compile it by running
g++ util/quality_counts.cpp -O3 -march=native -std=c++11 -o util/quality_counts.out
- Generate the quality counts by running
./util/quality_counts.out file.quality file.quality_counts
- Update the read length and input file in the header of noise_entropy.py and run
python util/noise_entropy.py
to get the noise entropies for models of different orders.
git clone https://github.com/lrog/orcom.git
cd orcom
make boost
cd bin
./orcom_bin e -iPATH/SRR065390.fastq -oPATH/SRR065390.bin
./orcom_pack e -iPATH/SRR065390.bin -oPATH/SRR065390.orcom
./orcom_pack d -iPATH/SRR065390.orcom -oPATH/SRR065390.dna
The dna file is of the form:
read1
read2
..
git clone https://github.com/GATB/leon.git
cd leon
sh INSTALL
./leon -file PATH_TO_FASTQ -c -seq-only -nb-cores NUM_THR
./leon -file PATH_TO_LEON -d -nb-cores NUM_THR
The fasta.d file is of the form:
>1
read1
>2
read2
..