HBAR-DTK: A Python repository from lhon

What is HBAR-DTK?
==================

HBAR-DTK stands for "Hierarchical-Based AssembleR Development ToolKit". The
code is derived from the prototype process for developing the "Hierachical
Genome Assembly Process (HGAP)" that PacBio(R) provides to the scientists who
uses PacBio RS(R) for genome assembly. For the most update-to-date instruction
on using the PacBio's official HGAP software, one should check the
https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP link to
see how to run HGAP from the official release.

HBAR-DTK provides a small number of python scripts for testing and developing
hierarchical-based assembly algorithms. While it might be useful for people who
like to test it out on using PacBio(R) data for assemblies at different scales,
it is not meant to be used by end-users who desire for a "push-button"
bioinformatics workflow for genome assemblies.

Here is the list of the functions of the major scripts:

HBAR_WF.py:

A pypeflow driven workflow script to process data and submmitting jobs
to a SGE cluster for every steps of thehierachical genome assembly
process

filterM4Query.py:

A short script for filtering and sorting the alignment output from
blasr aligner so best hits of each reads to seeds are identified from
mutiple chuncks of the output files. This code is called by the
HBAR_WF.py. It is not meant to be called directly from command line.

generate_preassemble_reads.py:

A script taking the output from blasr aligner and the read sequence
data to generate preassembled reads that can be feed into an
assembler. This code is called by the
HBAR_WF.py. It is not meant to be called directly from command line.

Some convenient scripts:

tig-sense.py:

A hack to use pbdagcon to get consensus of all contigs from Celera(R) Assembler
untig stage

tig-sense_p.py:

A multiprocessing version of the tig-sense.py. It can use more CPU cores for
the consensus tasks

CA_best_edge_to_GML.py:

A simple script to convert Celera(R) Assembler's "best.edges" to a GML
which can be used to feed into Gephi to check the topology of the best
overlapping graph.

Usage: python CA_best_edge_to_GML.py asm.gkp_store asm.tigStore best.edge output.gml

circulization.py:

The script provides an ad-hoc way to circularize a circular genome. The
logic used here: (1) check whether the ends of a contig are overlapped.
(2) If so, trimming both ends according the overlapped alignment. While
this is useful for some case, IT IS NOT REALLY A RELIABLE WAY TO
CIRCULARIZE A GENOME. The proper way is to look into the overlapping
graph to ensure one actually gets a circular overlapping graph for a
contig before trimming the ends. Overlapped ends can be due to repeats.
Please understand the subtlety of the troubles that the repeats can
cause during the process of assembling a genome. One can use the
``CA_best_edge_to_GML.py`` script to generate a GML file to check a
contig can be indeed unambiguously circularized visually.

Uasge: python circulization.py initial_contigs.fastq 20000 /tmp circulaized_contigs.fastq

where "20000" is the length to align at the ends of each contigs and
"/tmp" is the location for the temporary output used to run ``blasr``

ContaminationTrimmer.py:

A script contributed by Brett Bowman for trimming out vector sequences
or contamination using blasr. It is useful for pooled fosmid or BAC
sequence assembly. It is a different coding style.

check ``ContaminationTrimmer.py -h`` for usage

Most of these code was written on-fly when it was necessary. More handy scripts
may be added into the repository in the near furture if they are useful to make
the manual HGAP more scalable and easier. Some of the code might need
singificant and proper refactoring work. While we use the code constantly and
get good results in our own computing environment, we have limited experience
and testing on more diversed computing environments. However, everyone is
welcome to download the code, modify it for installing in your system if it
helps on automating some of these bioinformatics tasks. No official support
from PacBio for using these scripts will be provided at this moment. However,
bug reports or improvements suggestions are welcome.

A more detailed installation note and usage can be found in the file HBAR_README.rst

lhon/HBAR-DTK