/FAME

Primary LanguageC++GNU General Public License v3.0GPL-3.0

FAME

Fast and Accurate MEthylation Aligner for large mammalian genomes. Carries out alignment and methylation calling for CpGs of Whole Genome Bisulfite Sequencing (WGBS) reads in one go without the need of intermediate alignment or buffer files. The algorithm is working on the full alphabet (A,C,G,T), resolving the asymmetric mapping problem* correctly. Reference genomes are expected to be .fasta files and reads in .fastq (or .fastq.gz) format.

The code is written in C++ and parallelized using OpenMP and is licensed under GPL3.

*In WGBS experiments, unmethylated Cytosines are converted to Thymines in Reads, thus it is necessary to allow Read Cytosines to map to Reference Thymines, but not vice versa. This is termed asymmetric mapping problem and is commonly tackled by working only on the reduced alphabet (A,T,G), which results in false matchings of reference Ts to read Cs.

0) News

This tool is still under development. We will update the news section once we published the algorithm and a stable version of the code is available.

1) Getting Started

The following steps explain how to install and setup FAME .

A) Dependencies

FAME is dependent on the following libraries, which are all shipped with this repository. So no additional work is necessary here.

  • ntHash - a fast library for genomic rolling hash functions.
  • sparsehash - an efficient hash map implementation from Google.
  • hopscotch-map - an efficient hash map implementation by Tessil.
  • gzstream - a C++ stream interface for working with gzip files.

To compile the program, a recent version of the GNU Compiler Collection (gcc) or LLVM Clang (clang) is required. The gzstream library is dependent on zlib, which is usually installed on Linux systems.

B) Installation

To retrieve the code base, just clone this repository:

git clone https://github.com/FischerJo/Metal

To install FAME on your machine, just use the shipped Makefile by typing

make

in the top level directory of the cloned repository.

C) Simple example

FAME, as most other tools, uses a sohpisticated index structure of the reference genome to accelerate alignment computation. The tool has two separate functions, the construction of the index and the alignment of the reads. After construction of the index, it is stored in an efficient binary format and can thus be reused for different experiments.

To construct an index with the parameters specified in CONST.h (see 2.A):

./FAME --genome /Path/To/Genome/genome.fasta --store_index /Path/To/produced_index

where genome.fasta is a genome stored in fasta format and produced_index is the name of the output file holding the index after execution. For a full set of options see 2.A and 2.B.

To align reads to an index call:

./FAME -r /Path/To/reads.fastq --load_index /Path/To/produced_index -o /Path/To/output

with /Path/To/output being the output file path for the CpG report. For a full set of options we refer to section 2.B.

2) Manual

A) Parameter settings

All parameters are set in the file CONST.h. If you change any of those parameters, make sure to rebuild the program:

make clean
make

An index is dependet on the parameters, that is, if you call the program with an index that was built with different parameters, it will throw an error.

Here is a list of the external parameters:

Parameter Definition Recommended value Location (line number)
READLEN (Maximum) Length of reads queried to the index 100 35
CORENUM Number of threads spawned by the program. Should be number of free cores on the system. 16 38
MINPDIST Minimum distance between a read pair in paired end mode. Measured from end to first read to beginning of second read. 20 41
MAXPDIST Maximum distance between a read pair in paired end mode. Measured from end to first read to beginning of second read. 400 42
CHROMNUM Number of chromosomes of reference organism. 24 45

Here is a list of some important internal parameters, we strongly recommend NOT to change them:

Parameter Definition Recommended value Location (line number)
SEED The gapped q-gram (aka seed) to use as array of bits [see below] 57
SEEDBITS The gapped q-gram as bitstring 0b11011110111111011111111111111101 58
CHUNKSIZE Number of reads (or read pairs) read to buffer. 300000 69
KMERLEN k, the length of a k-mer for the index. This is a very sensitive parameter. Must be the length of the seed. 32 73
QTHRESH Minimum number of k-mer matches required for match verification 5 80
WINLEN Window length for the index data structure. 2048 90
MISCOUNT Number of errors considered for k-mer filters. 2 94
ADDMIS Number of errors additionally (to MISCOUNT) allowed in alignment 4 97
KMERCUTOFF Controls hash collisions in index. Low value means more lossy but faster filter. Not considered if --no_loss flag is set during index construction. 1500 100
KMERDIST Controls pruning after matching a read to a Window. Minimum distance of count of window to prune and count of matched window. 10 104
SKIPMOD Hash only every SKIPMODth k-mer of reference. 2 107

B) FAME command line arguments

The following is a description of all arguments accepted by FAME. Examples on how to use FAME in the command line are given in 2.D.

Flag Argument Description
--genome Filepath Forces the tool to build an index for the specified .fasta reference file
--gzip_reads None Treats the read files passed to -r or -r1 and -r2 as gzipped files.
-h None Lists all available options with a description.
--help None see -h
--human_opt None Uses optimizations for human reference genomes to prune away unlocalized contigs etc
--load_index Filepath Loads the index constructed before. NOTE: Parameters in CONST.h must be the same.
--no_loss None Builds the index with lossless filter (NOT RECOMMENDED).
-o Filepath Base name for output file. Contains CpG methylation levels after processing.
--out_basename Filepath see -o
-r Filepath Forces the tool to query the specified single end read .fastq file to a loaded index.
-r1 Filepath Path to file with first reads of a paired read set. Read format must be .fastq.
-r2 Filepath Path to file with second reads of a paired read set. Read format must be .fastq.
--store_index Filepath Writes output of index construction to filepath (~38GB for human genome). NOTE: Directory must exist.
--unord_reads None Disable optimization to find stranding of reads.

C) Output format

The alignment of a read set to an index produces an output file (see -o command line argument) that contains methylation levels of all CpGs in the reference genome. The file is a tab-separated value file (.tsv) with the following structure:

Chromosome  Position   #Meth_Cs_fwd   #Unmeth_Cs_fwd   #Meth_Cs_rev   #Unmeth_Cs_rev

where #Meth_Cs_fwd means number of methylated Cytosines on forward strand CpG and so on. The position is a (zero based) count of the bases of a chromosome, indicating the position of the C of a CpG. The forward strand is the strand provided in the reference file, the reverse complement strand the strand not provided in the reference file.

D) Extended examples

Suppose you want to match paired reads to a human genome and save it to the file with prefix pairedResults. The reads are in gzip format.

First build the index:

./FAME --genome /Path/To/Genome/genome.fasta --store_index /Path/To/produced_index --human_opt

Align the reads:

./FAME --load_index /Path/To/produced_index -r1 /Path/To/r1.fastq.gz -r2 /Path/To/r2.fastq.gz --gzip_reads -o pairedResults

A file called pairedResults_cpg.tsv is generated containing the CpGs with corresponding methylation counts.

3) Contact

We appreciate any feedback to our tool. Feel free to contact us by mailing to fischer 'at' mpi-inf.mpg.de.

If you found a bug, please contact us with a description of how you called the tool and which data you used.

4) Authors

  • Jonas Fischer - Main Contributor - Max-Planck-Institute for Informatics, Saarbruecken, Germany
  • Marcel H. Schulz Max-Planck-Institute for Informatics, Saarbruecken, Germany

5) License

This project is licensed under the GPL3 - see the LICENSE_GPL_3_0 file for details.

6) References