*** PLEASE GO TO THE DAZZLER BLOG (https://dazzlerblog.wordpress.com) FOR TYPESET *** DOCUMENTATION, EXAMPLES OF USE, AND DESIGN PHILOSOPHY. /************************************************************************************\ * * * Copyright (c) 2014, Dr. Eugene W. Myers (EWM). All rights reserved. * * * * Redistribution and use in source and binary forms, with or without modification, * * are permitted provided that the following conditions are met: * * * * · Redistributions of source code must retain the above copyright notice, this * * list of conditions and the following disclaimer. * * * * · Redistributions in binary form must reproduce the above copyright notice, this * * list of conditions and the following disclaimer in the documentation and/or * * other materials provided with the distribution. * * * * · The name of EWM may not be used to endorse or promote products derived from * * this software without specific prior written permission. * * * * THIS SOFTWARE IS PROVIDED BY EWM ”AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, * * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND * * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL EWM BE LIABLE * * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES * * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS * * OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING * * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN * * IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * * * For any issues regarding this software and its use, contact EWM at: * * * * Eugene W. Myers Jr. * * Bautzner Str. 122e * * 01099 Dresden * * GERMANY * * Email: gene.myers@gmail.com * * * \************************************************************************************/ The Dextractor and Compression Command Library Authors: Gene Myers, Martin Pippel First: December 21, 2013 Current: February 26, 2014 The Dextractor commands allow one to pull exactly and only the information needed for assembly and reconstruction from the source .bax.h5 HDF5 files produced by the PacBio RS II sequencer. Generally speaking, this information is the sequence of all the reads coded in the .bax.h5 file and a number of quality value (QV) streams needed by Quiver to produce a highly accurate consensus sequence as the last step in the assembly process. The Dextractor therefore produces a .fasta file of the sequence of all the reads, and a .quiva file containing the QV stream information in a .fastq readable format. For each of these two file types the library contains commands to compress the given file type, and to decompress it, which is a reversible process delivering the original uncompressed file. In this way, users of a PacBio can keep the data needed for assembly spooled up on disk in 1/14th the space occupied by the .bax.h5 files which can be archived to a cheap backup medium such as tape, should the raw data ever need to be consulted again (we expect never unless the spooled up data is compromised or lost in some way). The compressor/decompressor pairs are endian-aware so moving compressed files between machines is possible. 1. dextract [-vq] [-o[<path>]] [-l<int(500)>] [-s<int(750)>] <input:bax.h5> ... The dextract'or takes the .bax.h5 files produced for a given SMRT cell as input and: (a) if the -o option is set, then the information needed for Quiver is extracted and put in a file named <path>.quiva. If the -q option is not set, then the sequence of each read is placed in a file named <path>.fasta, otherwise a .fastq file of the sequence and the imputed "quality values" for each base in the sequence is placed in a file named <path>.fastq. We personally do not find these values useful and so never set -q but we give you the option in case your downstream processes use such values. If <path> is missing, then the path of the first .bax.h5 file is used for the output file name, less any suffixes which are replaced by .fasta and .quiva. E.G., the call "dextract -o EColi.1.bax.h5 EColi.2.bax.h5 Ecoli.3.bax.h5" will result in the files EColi.fasta and Ecoli.quiva. (b) if the -o option is not set, then if the -q option is also not set, then a .fasta file of the sequence of each read is written to the standard output. Otherwise a .fastq file is written to the standard output. If the -v option is set then the program reports the processing of each .bax.h5 file, otherwise it runs silently. The parameter -l determines the shortest read length to be extracted (default 500) and the -s parameter determines the minimum quality/score of reads to be extracted (default 750 = 75%). 2. dexta [-vk] ( -i | <path:fasta> .. .) undexta [-vkU] [-w<int(80)>] ( -i | <path:dexta> ... ) Dexta compresses a set of .fasta files (produced by either Pacbio's software or dextract) and replaces them with new files with a .dexta extension. That is, submitting G.fasta will result in a compressed image G.dexta, and G.fasta will no longer exist. With the -k option the .fasta source is *not* removed. If -v is set, then the program reports its progress on each file. Otherwise it runs completely silently (good for batch jobs to an HPC cluster). The compression factor is always slightly better than 4.0. Undexta reverses the compression of dexta, replacing the uncompressed image of G.dexta with G.fasta. By default the sequences output by undexta are in lower case and 80 chars per line. The -U option specifies upper case should be used, and the characters per line, or line width, can be set to any positive value with the -w option With the -i option set, the programs run as UNIX pipes that take .fasta (.dexta) input from the standard input and write .dexta (.fasta) to the standard output. In this case the -k option has no effect. 3. dexqv [-vkl] <path:quiva> ... undexqv [-vkU] <path:dexqv> ... Dexqv compresses a set of .quiva files (produced by dextract) into new files with a .dexqv extension. That is, submitting G.quiva will result in a compressed image G.dexqv, and G.quiva will not longer exist. The -k flag prevents the removal of G.quiva. With -v set progress is reported, otherwise the command runs silently. If slightly more compression is desired at the expense of being a bit "lossy" then set the -l option. This option is experimental in that it remains to be seen if Quiver gives the same results with the scaled values responsible for the loss. Undexqv reverses the compression of dexqv, replacing the uncompressed image of G.dexqv with G.quiva. The flags are analgous to the v & k flags for dexqv. The compression factor is typically 3.4 or so (4.0 or so with -l set). By .fastq convention each QV vector is output by undexqv as a line without intervening new-lines, and by default the Deletion Tag vector is in lower case letters. The -U option specifies upper case letters should instead be used for said vector. To compile the programs you must have the HDF5 library installed on your system and the library and include files for said must be on the appropriate search paths. The HDR5 library in turn depends on the presence of zlib, so make sure it is also installed on your system. The most recent version of the source for the HDF5 library can be obtained at: http://www.hdfgroup.org/HDF5/release/obtainsrc.html /*****************************************************************************\ PacBio Disclaimer THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES. \******************************************************************************/