What is HBAR-DTK? ================== HBAR-DTK stands for "Hierarchical-Based AssembleR Development ToolKit". The code is derived from the prototype process for developing the "Hierachical Genome Assembly Process (HGAP)" that PacBio(R) provides to the scientists who uses PacBio RS(R) for genome assembly. For the most update-to-date instruction on using the PacBio's official HGAP software, one should check the https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/HGAP link to see how to run HGAP from the official release. HBAR-DTK provides a small number of python scripts for testing and developing hierarchical-based assembly algorithms. While it might be useful for people who like to test it out on using PacBio(R) data for assemblies at different scales, it is not meant to be used by end-users who desire for a "push-button" bioinformatics workflow for genome assemblies. Here is the list of the functions of the major scripts: HBAR_WF.py: A pypeflow driven workflow script to process data and submmitting jobs to a SGE cluster for every steps of thehierachical genome assembly process filterM4Query.py: A short script for filtering and sorting the alignment output from blasr aligner so best hits of each reads to seeds are identified from mutiple chuncks of the output files. This code is called by the HBAR_WF.py. It is not meant to be called directly from command line. generate_preassemble_reads.py: A script taking the output from blasr aligner and the read sequence data to generate preassembled reads that can be feed into an assembler. This code is called by the HBAR_WF.py. It is not meant to be called directly from command line. Some convenient scripts: tig-sense.py: A hack to use pbdagcon to get consensus of all contigs from Celera(R) Assembler untig stage tig-sense_p.py: A multiprocessing version of the tig-sense.py. It can use more CPU cores for the consensus tasks CA_best_edge_to_GML.py: A simple script to convert Celera(R) Assembler's "best.edges" to a GML which can be used to feed into Gephi to check the topology of the best overlapping graph. Usage: python CA_best_edge_to_GML.py asm.gkp_store asm.tigStore best.edge output.gml circulization.py: The script provides an ad-hoc way to circularize a circular genome. The logic used here: (1) check whether the ends of a contig are overlapped. (2) If so, trimming both ends according the overlapped alignment. While this is useful for some case, IT IS NOT REALLY A RELIABLE WAY TO CIRCULARIZE A GENOME. The proper way is to look into the overlapping graph to ensure one actually gets a circular overlapping graph for a contig before trimming the ends. Overlapped ends can be due to repeats. Please understand the subtlety of the troubles that the repeats can cause during the process of assembling a genome. One can use the ``CA_best_edge_to_GML.py`` script to generate a GML file to check a contig can be indeed unambiguously circularized visually. Uasge: python circulization.py initial_contigs.fastq 20000 /tmp circulaized_contigs.fastq where "20000" is the length to align at the ends of each contigs and "/tmp" is the location for the temporary output used to run ``blasr`` ContaminationTrimmer.py: A script contributed by Brett Bowman for trimming out vector sequences or contamination using blasr. It is useful for pooled fosmid or BAC sequence assembly. It is a different coding style. check ``ContaminationTrimmer.py -h`` for usage Most of these code was written on-fly when it was necessary. More handy scripts may be added into the repository in the near furture if they are useful to make the manual HGAP more scalable and easier. Some of the code might need singificant and proper refactoring work. While we use the code constantly and get good results in our own computing environment, we have limited experience and testing on more diversed computing environments. However, everyone is welcome to download the code, modify it for installing in your system if it helps on automating some of these bioinformatics tasks. No official support from PacBio for using these scripts will be provided at this moment. However, bug reports or improvements suggestions are welcome. A more detailed installation note and usage can be found in the file HBAR_README.rst