/HBAR-DTK

Hierarchical based assembler development toolkit

Primary LanguagePython

What is HBAR-DTK?

Last Modifide : June 11, 2013

HBAR-DTK stands for "Hierarchical-Based AssembleR Development ToolKit". The code is derived from the prototype process for developing the "Hierachical Genome Assembly Process (HGAP)" that PacBio(R) provides to the scientists who uses PacBio RS(R) for genome assembly. For the most update-to-date instruction on using the PacBio's official HGAP software, one should check the https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP link to see how to run HGAP from the official release.

HBAR-DTK provides a small number of python scripts for testing and developing hierarchical-based assembly algorithms. While it might be useful for people who like to test it out on using PacBio(R) data for assemblies at different scales, it is not meant to be used by end-users who desire for a "push-button" bioinformatics workflow for genome assemblies.

Here is the list of the functions of the major scripts:

HBAR_WF2.py:

    A pypeflow driven workflow script to process data and submmitting jobs
    to a SGE cluster for every steps of the hierachical genome assembly
    process. This workflow is designed for larger genome assembly project 
    where the data may not come at once. It uses different way from the 
    ``HBAR_WF.py`` for chunking the input sequencing data and gathering
    the alignment results. The old alignment data will be re-used if new 
    inputs are added. 

HBAR_WF.py:

    A pypeflow driven workflow script to process data and submmitting jobs
    to a SGE cluster for every steps of the hierachical genome assembly
    process

filterM4Query.py:

    A short script for filtering and sorting the alignment output from
    blasr aligner so best hits of each reads to seeds are identified from
    mutiple chuncks of the output files. This code is called by the
    HBAR_WF.py.  It is not meant to be called directly from command line.

generate_preassemble_reads.py:
    
    A script taking the output from blasr aligner and the read sequence
    data to generate preassembled reads that can be feed into an
    assembler. This code is called by the
    HBAR_WF.py.  It is not meant to be called directly from command line.

h5fofn_to_fasta.py:
    
    Used by ``HBAR_WF2.py`` to convert ``bas.h5``, ``bax.h5``, ``fasta``
    or ``fastq`` for its internal ``fasta`` files.

get_pread.py:

    a modified version of generate_preassemble_reads.py used by ``HBAR-WF2.py``

simple_asm.py:

    A preliminary implementation for the first step of "Consistent Long-read
    Evidence Essembly pRocess" (CLEAR)

Some convenient scripts:

tig-sense.py:
    
    A hack to use pbdagcon to get consensus of all contigs from Celera(R) Assembler
    untig stage

tig-sense_p.py:
    
    A multiprocessing version of the tig-sense.py. It can use more CPU cores for
    the consensus tasks 

CA_best_edge_to_GML.py:
    
    A simple script to convert Celera(R) Assembler's "best.edges" to a GML
    which can be used to feed into Gephi to check the topology of the best
    overlapping graph.

    Usage: python CA_best_edge_to_GML.py asm.gkp_store asm.tigStore best.edge output.gml 

circulization.py:

    The script provides an ad-hoc way to circularize a circular genome. The
    logic used here: (1) check whether the ends of a contig are overlapped.
    (2) If so, trimming both ends according the overlapped alignment. While
    this is useful for some case, IT IS NOT REALLY A RELIABLE WAY TO
    CIRCULARIZE A GENOME. The proper way is to look into the overlapping
    graph to ensure one actually gets a circular overlapping graph for a
    contig before trimming the ends. Overlapped ends can be due to repeats.
    Please understand the subtlety of the troubles that the repeats can
    cause during the process of assembling a genome.  One can use the
    ``CA_best_edge_to_GML.py`` script to generate a GML file to check a
    contig can be indeed unambiguously circularized visually.

    Uasge: python circulization.py initial_contigs.fastq 20000 /tmp circulaized_contigs.fastq

           where "20000" is the length to align at the ends of each contigs and
           "/tmp" is the location for the temporary output used to run ``blasr`` 

ContaminationTrimmer.py:
    
    A script contributed by Brett Bowman for trimming out vector sequences
    or contamination using blasr.  It is useful for pooled fosmid or BAC
    sequence assembly. It is a different coding style.

    check ``ContaminationTrimmer.py -h`` for usage

Most of these code was written on-fly when it was necessary. More handy scripts may be added into the repository in the near furture if they are useful to make the manual HGAP more scalable and easier. Some of the code might need singificant and proper refactoring work. While we use the code constantly and get good results in our own computing environment, we have limited experience and testing on more diversed computing environments. However, everyone is welcome to download the code, modify it for installing in your system if it helps on automating some of these bioinformatics tasks. No official support from PacBio for using these scripts will be provided at this moment. However, bug reports or improvements suggestions are welcome.

A more detailed installation note and usage can be found in the file HBAR_README.rst


<script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-13166584-23', 'github.com'); ga('send', 'pageview'); </script>