statmap: A C repository from nboley

Statmap is a mapping utility.

Statmap is a fast alignment tool designed for mapping short reads to a reference 
genome under a probability model. The probability model allows statmap to do many 
things that other mapping tools cannot, including

1) Mapping to uncertain reference genome
2) Estimating mapping probabilities for multi-mappers and then intelligently 
   placing them under an assay specific model
3) Estimating confidence bounds on mapped read density.

Website:      http://encodestatistics.org/statmap/
Contact:      Nathan Boley ( nboley (AT) berkeley.edu )
Mailing List: http://groups.google.com/group/statmap

Installing Statmap
==================

Statmap needs to be built from source, and has several dependencies. In ubuntu,
run

> sudo apt-get install clang r-base-dev python-matplotlib

and then run 

> make clean; make; make check;

from the base directory to build statmap and run the unit tests. The statmap
binary will be copied to the $STATMAP_SRC/bin/. 

Using Statmap
=============

Statmap maps reads in the FASTQ format to a reference genome in a binary genome
format. You can use a pre-built binary genome, or build one from FASTA files
with the build_index utility.

build_index
-----------

utilities/build_index can build a binary genome for a haploid or diploid data
set, or a combination of the two. The arguments for build_index are

    ./build_index indexed_seq_len output_filename genome.fa(s) [diploid.map(s)]

indexed_seq_len is the length of the index probe that Statmap uses to search the
genome. It should be less than or equal to the length of the shortest read you 
plan to map. CAGE experiments need indexes that have probe lengths at least 3 
basepairs shorter than the shortest CAGE sequence, to account for untemplated 
G's.

For a haploid data set, you can simply specify any number of FASTA files. For a
diploid data set, you need to specify 3 files for each chromosome you wish to 
index: the paternal fasta, the maternal fasta, and a MAP file. SNPs are inferred
by comparing the MAP file entries to the actual genome sequence. All of the 
datasets from the 1000 Genomes project, for example, provide these files. You 
may specify input files in any order - build_index infers their relationships 
based on their filenames and FASTA identifiers.

FInally, for diploid, Statmap requires a consistent naming scheme to correctly 
group FASTAs with their associated MAP files, since the MAP file format does 
not contain a sequence specifier at this time. We require that all FASTA files 
and MAP files begin with the name of the chromosome followed by an underscore, 
and then optionally other text. We require that this name match the prefix of 
the identifier in each associated FASTA file.

Input
=====

Statmap **requires** a reference genome and reads to map.

Options are provided in both long and short forms. To see the full list, run
`statmap --help` or `statmap -?`.

Genome
------

Currently, statmap accepts a reference genome as a Statmap binary genome file.
To create the binary genome file, use `utilities/build_index.py`.

-g - reference genome - in binary format

Reads
-----

Currently, statmap ONLY ACCEPTS READS IN THE FASTQ FORMAT

-r : fastq input file for single end reads

OR 

-1 : fastq input file for the first set of read pairs
-2 : fastq input file for the second set of read pairs

In addition, there are several optional input types

-c : Specifies negative control data ( only used with the Chip Seq assay type )

OR

-3 : Specifies pair 1 negative control data ( only Chip Seq )
-4 : Specifies pair 2 negative control data ( only Chip Seq )


Optional arguments
------------------

-o : directory to write output to

The default is statmap_ouput_month_day_date_hour_min.

-s : marginal mapping search type

The two valid options are 'e' (estimated error model) and 'm' (mismatches).
These options determine the method Statmap uses to map reads to the reference.
The default is 'e'.

-p : mapping metaparameter

The meaning of this parameter depends on the marginal mapping search type.

For the estimated error model, this is the % of the input reads that Statmap
will attempt to map. It guides Statmap in adjusting the index search parameters
based on observed data that is bootstrapped from a subset of the input reads.
The default is 0.99.

For the mismatch search type, this is the fraction of bases in each read that 
are allowed to be mismatches in a mapping. For a 100bp read, a setting of 0.1
means that we will find mappings with up to 10 mismatches. The default is 0.1.

-a : The type of the underlying assay.

The two valid options are 'i' ( for chip_seq ) and 'a' ( for cage ). This must
be set in order to use Statmap's iterative mapping functionality, as the kernel
is assay-specific.

-n : the number of samples to take

This is the number of samples to take from the mapping posterior. This defaults
to 0 - Statmap will set up the iterative mapping framework, but will not 
actually iteratively map unless this parameter is set to non-zero. You can 
always do iterative mapping later with sample_mapping.

Empirically, 10 has been a reasonable default.

-t : Number of threads

Defaults to 8 or all available threads, whichever is less.

-f : The fragment length distribution. 

The fragment length density, as a tab delimited file with no header. The first  
entry in each line should be a length, and the second the fraction of fragments
that one expects to be this length. ie

100   0.01
101   0.01
...
199   0.01

Would be uniform in 100-199.  

We need this for single end chip-seq reads to iteratively map, and any paired 
end reads. However, For paired end reads, the default is to use uniquely 
mapping reads to infer the  fragment length distribution. However, for rna-seq, 
this is not necessarily correct - so be careful. 

In addition, there is a script for generating this file in utilities.

Output
======

Statmap creates a directory in which it stores all of the mapping results. The
list of saved output depends on the command line parameters. 

** STATMAP USES LOTS OF DISK SPACE **
If you are trying to map a full human experiment without a terrabyte of free 
storage, you could very well run into problems several days into mapping. DONT 
TRY - DISKS ARE CHEAP. For drosophila, 150GB or so is probably good.
** STATMAP USES LOTS OF DISK SPACE **

The ouput files can be divided into several categories:

WIGGLE:

Wiggle tracks are assay dependent. For instance, CAGE reads returns two tracks -
a forward stranded track and a reverse stranded track, whereas paired chip-seq 
just returns read densities on one strand. 

marginal_mapping_(fwd/bkwd).wig - 

Stores an estimate of the expectation of the posterior read density under the 
uniform prior assumption on the reference genome. That is, if a read maps 
equally well to two positions, we set the 'probability' that it came from each 
to 0.5

starting_samples/sample_X.wig -

When we try and sample from the space of all 'plausible' mappings, the first 
step is to look for a random, but plausible, starting location. This directory 
stores each of these starting locations as a wiggle. They can be useful if one 
wants to  visualize the the read dependence, but they are probably not terribly 
useful for most. However, they are probably worth keeping for the publication 
run because, since the sampling is random, the results are only truely 
reproducible with the random starting locations ( although, of course, if you 
are getting different results from different runs, you are not sampling deeply 
enough. )

starting_samples/meta_info.csv -

Store meta information about each starting sample. For now, this is only the 
likelihood.

samples/sample_X.wig -

These are the relaxed mappings for each starting sample. If one is interested in
exploring the variance of biological conclusions, then the best way ( if 
feasible ) is to make inference about each of these samples, and then weight by 
their respective likelihoods. For instance, if one is interested in assessing 
the variation in called peaks, a good approach might be to run the peak caller 
on each wiggle file and then report confidence as the percentage of times that 
a given peak was called. 

samples/meta_info.csv -

Store meta information about each sample. For now, this is only the likelihood.

bootstrap_samples/all_traces/sampleX/bssampleY.wig

Stores a bootstrap sample from local maximum X, bootstrap sample Y.


============ GENERATE THESE WITH ./utilities/build_aggregates.py ===============

bootstrap_samples/(min/max)_traces

Store wiggles aggregated over each of the bootstrap samples. 

min_read_density.wig - 

Stores the minimum over all samples. This is a very conservative lower bound on 
the read density.

max_read_density.wig - 

Stores the maximum over all samples. This is a very conservative upper bound on 
the read density.

================================================================================

============ GENERATE SAM WITH mapped_reads_into_sam ===========================

** WARNING - SAM files can get very big **

SAM:

mapped_reads.sam - 

Stores all of the mapped reads.

Unfortunately, the sam spec requires that the 'posterior mapping probability' be
stored as a phred score. Often, this does not give us enough precision to make
accurate estiatmes, so I've added two optional columns:

XQ - stores the probability of observing a read ( or read pair ), given that it
     came from the position at that line. 

XP - stores the probability, after relaxation, that a read came from this 
     location given that it came from somewhere in the reference genome.

Furthermore, for paired end reads, statmap only reports reads that can be paired.
( This is, at least, against the spirit of the SAM format, but I dont really see
  the point of retaining unpaired reads ).

================================================================================

OUTPUT THAT YOU PROBABLY DONT CARE ABOUT-

mapped_reads.db - 

A binary file with all of the mapped reads. The SAM stores the same information.

===== Testing Statmap ==========================================================

test.py:

Statmap has been built with a test suite that checks several aspects of
it's operation. The fields and data require python and numpy to run,
and can be found in the 'tests' sub directory.

To run the tests, run 'make check' from the root directory. However,
for finer control over the tests, and to try out statmap, it may be 
worth looking at the tests individually.

There are two global configuration options found at the begining of the file:

P_STATMAP_INPUT - print the output generated by statmap to stdout 

( not the mappings, just the messages )

CLEANUP - remove the test file generated by each test.

If one is interested in testing various options, set CLEANUP to false, run
the test of interest, and then you can re-run statmap on the data files. 

simulate_chipseq.py:

This simulates to explore the variance estimates by simulating a chip-seq 
experiment, with a paralog and plotting the starting location and updated read 
densities in bootstrap_samples.png.
nboley/statmap