/Faucet

This is the codebase for Faucet, described in our manuscript: https://academic.oup.com/bioinformatics/article/34/1/147/4004871, by Roye Rozov, Gil Goldshlager, Eran Halperin, and Ron Shamir

Primary LanguageC++BSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

You can download Faucet here or clone it via the link below. In case you download the zip, unzip the file before following the instructions below (ignoring the 'git clone' line)

Getting Faucet

git clone https://github.com/rozovr/Faucet.git
cd Faucet/src
make depend
make    

Running Faucet (locally)

Example usage:

./faucet -read_load_file interlaced_reads.fq \
         -read_scan_file interlaced_reads.fq \
		 -size_kmer 31 \
		 -max_read_length 100 \
		 -estimated_kmers 1000000000 \
		 -singletons 200000000 \
		 -file_prefix faucet_outputs \
		 --fastq \
		 --paired_ends

The above command takes as input the file interlaced_reads.fq (where entries alternate between mates 1 and 2 of a paired end library), and the input format is fastq. Faucet does not accept separate mate files, but can accept fasta format and files composed of read sequences alone.

Streaming from a remote source

A demonstration streaming reads from a remote server is provided in the script src/stream_data_from_urls_list.sh

You can run it with:

./stream_data_from_urls_list.sh out wget_urls 1596741569 12045222

where wget_urls is a file with URLs downloaded from ENA, 1596741569 is the estimated number of unique kmers (F0) and 12045222 if the estimated number of singleton kmers (f1).

Requirements

Faucet was implemented in C++ 11, so requires a compiler that is not too ancient to support it, and has been tested only on Linux so far.

Detailed usage

Usage: ./faucet -read_load_file -read_scan_file -size_kmer -max_read_length -estimated_kmers <num_kmers> -singletons <num_kmers> -file_prefix Optional arguments: --fastq -max_spacer_dist -fp rate -j --two_hash -bloom_file -junctions_file --paired_ends --no_cleaning

required arguments:

-read_load_file <filename>, a file name string 
-read_scan_file <filename> , a file name string
-size_kmer <k> , and odd integer <= 31
-max_read_length <length>, the longest read length in the data (e.g., if the reads were trimmed to different sizes)
-estimated_kmers <num_kmers> 
-singletons <num_kmers> 
-file_prefix <prefix>, the desired prefix string or directory path for output files 

we recommend applying ntCard to extract the number estimated k-mers (F0) and singletons (f1) in the dataset.

License

  • Low level code for dealing with binary encoded k-mers and strings, and for Bloom filters is derived from the original minia implementation, http://minia.genouest.org/; these components, mostly unmodified, are distributed under a GPL 3.0 license

  • Original code is distributed under the BSD 3 clause license.