MarginPolish

MarginPolish is a graph-based assembly polisher. It iteratively finds multiple probable alignment paths for run-length-encoded reads and uses these to generate a refined sequence. It takes as input a FASTA assembly and an indexed BAM (ONT reads aligned to the assembly), and it produces a polished FASTA assembly.

While MarginPolish serves as a standalone assembly polisher, it is also part of an assembly pipeline which includes an ultrafast nanopore assembler Shasta and a multi-task RNN polisher HELEN. HELEN operates on images generated by MarginPolish.

Overview

MarginPolish takes reads and alignments from the BAM and generates an initial graph describing matches, inserts, and deletes observed in the alignments. It aligns each read to this graph and determines the probabilities for multiple likely alignments. After all reads are aligned, it generates weighted alignment scores for each node in the graph and uses these to determine a most-likely path through the graph. This path becomes the inital graph for the next iteration of the process. Iteration continues until the total weighted likelihood of alignments decreases between iteration steps, or until a configured maximum iteration count is reached.

All of these alignments are done in run-length space, which simplifies and cleans the alignments. This reduces effects from errors in homopolymer runs, which are the primary source of error in nanopore reads. After determining a most-likely run-length-encoded sequence, it expands the run-lengths to generate a final consensus sequence. This expansion is done using a Bayesian model which predicts the most likely true run-length from all run-length observations in the reads aligned to the node.

Execution Flow

MarginPolish first determines where there are read alignments on the initial assembly segments. It uses these positions to determine chunk boundaries, including a configurable overlap between chunks. Each chunk is polished separately (by a single thread) and the final consensus sequence is stored in memory. After all chunks are complete, MarginPolish stitches them together by aligning the overlap between chunks and finding a position to stitch at.

Assembly Workflow

For a detailed description of the end-to-end assembly workflow, see the documentation provided in the HELEN repository.

Installation

Dependencies

If compiling on Ubuntu, this will install all required packages:

apt-get -y install git make gcc g++ autoconf zlib1g-dev libcurl4-openssl-dev libbz2-dev libhdf5-dev

Note that libhdf5-dev is required for HELEN image generation. MarginPolish will work without this package, but will not include functionality for HELEN.

MarginPolish is compiled with cmake. We recommend using the latest cmake version, but 3.7 and higher are supported:

wget https://github.com/Kitware/CMake/releases/download/v3.14.4/cmake-3.14.4-Linux-x86_64.sh && sudo mkdir /opt/cmake && sudo sh cmake-3.14.4-Linux-x86_64.sh --prefix=/opt/cmake --skip-license && sudo ln -s /opt/cmake/bin/cmake /usr/local/bin/cmake
cmake --version

Compilation

Check out the repository and submodules:

git clone https://github.com/UCSC-nanopore-cgl/marginPolish.git
cd marginPolish
git submodule update --init

Make build directory:

mkdir build
cd build

Generate Makefile and run:

cmake ..
make
./marginPolish

Running MarginPolish

marginPolish <BAM_FILE> <ASSEMBLY_FASTA> <PARAMS> [options] 

Polishes the ASSEMBLY_FASTA using alignments in BAM_FILE.

Required arguments:
    BAM_FILE is the alignment of reads to the assembly (or reference).
    ASSEMBLY_FASTA is the reference sequence BAM file in fasta format.
    PARAMS is the file with marginPolish parameters.

Default options:
    -h --help                : Print this help screen
    -a --logLevel            : Set the log level [default = info]
    -t --threads             : Set number of concurrent threads [default = 1]
    -o --outputBase          : Name to use for output files [default = 'output']
    -r --region              : If set, will only compute for given chromosomal region.
                                 Format: chr:start_pos-end_pos (chr3:2000-3000).

HELEN feature generation options:
    -f --produceFeatures     : output features for HELEN.
    -F --featureType         : output features of chunks for HELEN.  Valid types:
                                 splitRleWeight:  [default] run lengths split into chunks
                                 nuclAndRlWeight: split into nucleotide and run length (RL across nucleotides)
                                 rleWeight:       weighted likelihood from POA nodes (RLE)
                                 simpleWeight:    weighted likelihood from POA nodes (non-RLE)
    -L --splitRleWeightMaxRL : max run length (for 'splitRleWeight' type only) [default = 10]
    -u --trueReferenceBam    : true reference aligned to ASSEMBLY_FASTA, for HELEN
                               features.  Setting this parameter will include labels
                               in output.

Miscellaneous supplementary output options:
    -i --outputRepeatCounts  : Output base to write out the repeat counts [default = NULL]
    -j --outputPoaTsv        : Output base to write out the poa as TSV file [default = NULL]

Sample Execution

./marginPolish ../tests/NA12878.np.chr3.5kb.bam ../tests/hg19.chr3.9mb.fa ../params/allParams.np.json -o example_output

Data Formats

MarginPolish requires that the input BAM is indexed, as it uses defined chunk regions to multithread the analysis and needs the index to extract these.

MarginPolish can accept read information both in FASTA and FASTQ formats (aligned). If quality scores are present, the base likelihood is factored into alignment weight estimation.

CRAM format is currently unsupported.

Configuration

MarginPolish uses a JSON configuration file which contains model and runtime parameterization for:

Pairwise Alignment
Graph Alignment
Run Length Estimation
Chunking
Read Filtering

We find that the run-length estimation correctness is tied to the basecaller version that was used to generate the reads. We have trained run-length estimation models for Guppy Flip-Flop v2.3.3 and v2.3.5 on human samples, and a model for ecoli based on v2.3.3. These can be found in the params/ directory:

allParams.np.human.guppy-ff-233.json

allParams.np.human.guppy-ff-235.json

allParams.np.ecoli.json

These parameters also include configuration for MarginPolish (see below). There are other parameter files in this directory for testing and experimentation.

HELEN Image Generation

HELEN is a multi-task RNN polisher which operates on images produced by MarginPolish. The images summarize the state of the nodes in the alignment graph before run-length expansion. They include the weights associated with read observations aligned at each node.

If MarginPolish is configured to generate images (the -f option), it will output a single .h5 file for each thread.

MarginPolish produces different image types (used during development) which can be configured with the -F flag, but users must use the default type 'splitRleWeight' for the trained models HELEN provides.

To produce HELEN training images, run marginPolish with the -u flag. This takes an argument of an indexed BAM alignment of the truth sequence to the assembly. MarginPolish will extract the alignments from this BAM for each analyzed chunk. If there is a single alignment at this location with a sequence that approximately matches the chunk's size, it is used to label the images for both nucleotide and run-length.

Miscellaneous Output

The -i flag will output a TSV file for each chunk describing the observed run lengths at each node in the final alignment. This can be used to train a Bayesian model which can predict run lengths. The -j flag will output a representation of the POA for each chunk.

Resource Requirements

While comprehensive resource usage profiling has not been done yet, we find that memory usage scales linearly with thread count, read depth, and chunk size. For this reason, our default parameters downsample read depth to 50 or 64 and restrict chunk size to 1000 bases.

With these parameters, we find that 2GB of memory per thread is sufficient to run MarginPolish on genome-scale assemblies and alignment.

Across 13 whole-genome runs, we averaged roughly 350 CPU hours per gigabase of assembled sequence.

Docker

There is a MarginPolish Docker container provided at docker.io/kishwars/margin_polish. To run, the directory with your input files must be mounted onto the /data directory in the container. An example run:

docker run -it --rm --user=`id -u`:`id -g` --cpus="16" -v </your/current/directory>:/data kishwars/margin_polish:latest --help
docker run -it --rm --user=`id -u`:`id -g` --cpus="16" -v </your/current/directory>:/data kishwars/margin_polish:latest input.bam input.fasta /opt/MarginPolish/params/<model_name.json> -t 32 -o /data/ -f

The entrypoint is to a script which redirects output to a file called marginPolish.log. This also contains output from the time program documenting runtime and max memory usage.

The image can be built by navigating to the docker/ directory and running make. Alternatively, the image can be acquired by running

docker pull kishwars/margin_polish:latest

Miscellaneous

MarginPhase

MarginPhase is a program for simultaneous haplotyping and genotyping with long reads. It relies on much of the same infrastructure as MarginPolish; the code to run it lives in this repository, although the compilation of the executable is currently disabled. We refer you to the MarginPhase repo (from which this is forked) at:

https://github.com/benedictpaten/marginPhase

It is our intention to eventually integrate these two applications to produce polished haplotypes instead of a single consensus.

Tests

After building, run the 'allTests' executable in your build directory. This runs every test. You can comment out ones you don't want to run in tests/allTests.c

UCSC-nanopore-cgl/MarginPolish