yalff: A C++ repository from yhhshb

Publications

Better quality score compression through sequence-based quality smoothing Presented at BITS Turin 2018.
Indexing k-mers in linear-space for quality value compression Presented at BIOINFORMATICS 2019.

Introduction

YALFF (Yet Another Lossy FASTQ Filter) is a smoother for FASTQ files which uses an FM-Index to store the k-mer database. The compressed index greatly reduces the amount of memory required compared to other tools such as QUARTZ. This is because the dictionary of k-mers can be linearized into contigs.

The actual compression is achieved by standard lossless compressors such as gzip, bzip2 and xz. The compression ratio is increased by the smoothing procedure which basically reduces entropy by replacing most of the quality values with a fixed value. The algorithm guarantees that the most relevant qualities for downstream analysis are kept untouched.

Smoothing by using a reference

The most easy way to have a reliable set of k-mers to work with is by indexing an already available reference genome. This option is recommended if one doesn't have a list of known SNPs and/or a reassembly procedure of all the k-mers coming from real datasets would be too expensive to carry on.

SNPs-Aware solution

If a set of SNPs is known it is possible to add the mutated k-mers into the reference to smooth a larger number of values and to signal the correctness of a base to downstream analysis. All the reassebled indeces used in our studies can be found at:

The results produced by this type of indeces are better in terms of overall compression, Precision and Recall than the standard reference, similarly to what happens in GeneCodeq. It is also possible to reassemble a k-mer dictionary with the assembler developed for the ProPhyle package. For this purpose the script folder contains the utility print_mitdb.c to print the k-mers of the dictionaries generated by Quartz and LAVA which can then be redirected to the assembler.

The whole procedure can be described as follows:

Download the Quartz dictionary from the original links 1 and 2.
Clone the LAVA repository, compile it and download the hg17 and the associated SNP lists dbSNPs, affy.
Run LAVA on its files to construct its reference indeces.
Use print_mitdb.c to stream the contents of the 4 dictionaries (the unswapped one for Quartz and the 3 for LAVA) to the Prophasm assembler.
Index the FASTA file produced by the assembler.

The commands described here can be found at the scripts folder.

Installation

YALFF links statically to bwa to use its FM-Index implementation and shared memory capabilities. This has the implicit advantage of not requiring a separate construction step if an index is already available. The zlib library is the only system-wide dependency required.

git clone --recursive https://github.com/yhhshb/yalff.git
cd yalff
make

The current stable version can be found at: but it does not comprehend a copy of bwa nor a copy of the CTPL library which must be downloaded and added separately.

Evaluation

The Precision, Recall and F-Measure are computed by aligning the smoothed dataset to the reference and comparing the quality of the resulting alignment to a standard ground truth. The dataset used for comparison in our study is the Platinum genome NA12878 and relative vcf files downloaded from the Illumina ftp (which is open access and if it asks for a password just continue).

A script is available containing the whole pipeline used for evaluation.

Usage

YALFF preserves the orders of the reads of the input files so there is no need for a paired-end mode.

Smoothing a fastq file:

  cat file.fastq | ./yalff -d index.fa > file_smoothed.fastq

Smoothing a gzipped fastq file:

  zcat file.fastq.gz | ./yalff -d index.fa | gzip > file_smoothed.fastq.gz

Smoothing using an index loaded in shared memory:

  bwa shm index.fa
  zcat file.fastq.gz | ./yalff -shm index.fa | gzip > file_smoothed.fastq.gz
  bwa shm -d

All the options available are:

  -d STR	 Reference file in fasta format.

  -k NUM	 k-mer length. [32] (max = 255 excluded)

  -m NUM	 Number of mismatches allowed per k-mer. [1]

  -c NUM	 Chunk size. The number of reads read at once on each iteration. [10000]

  -b CHAR	 Sanger threshold for a quality score to be considered. [$]

  -g CHAR	 Sanger threshold for a quality score to be considered correct independently from the dictionary. [I]

  -s NUM	 Number of bases to skip after each k-mer. A value of 0 checks all the k-mers. [0]

  -q CHAR	 Sanger value used as replacement during smoothing. [I]

  -e CHAR	 Sanger value used as an eventual replacement when a k-mer aligns badly. [j]

  -t NUM	 Number of threads available. [Hardware concurrency - 1]

  -sst NUM	 Smoothing algorithm. 0 checks all and only the k-mers considered. 1 applies a seed and extend search if a k-mer has no mismatches. [0]

  -shm STR	 Reference file loaded into shared memory.

  -h       See this help.

Availability

YALFF is licenced under the MIT licence. Be aware that the final executable will be GPL-ed because of the linking at object level with bwa.