/************************************************************************************\ * * * Copyright (c) 2014, Dr. Eugene W. Myers (EWM). All rights reserved. * * * * Redistribution and use in source and binary forms, with or without modification, * * are permitted provided that the following conditions are met: * * * * · Redistributions of source code must retain the above copyright notice, this * * list of conditions and the following disclaimer. * * * * · Redistributions in binary form must reproduce the above copyright notice, this * * list of conditions and the following disclaimer in the documentation and/or * * other materials provided with the distribution. * * * * · The name of EWM may not be used to endorse or promote products derived from * * this software without specific prior written permission. * * * * THIS SOFTWARE IS PROVIDED BY EWM ”AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, * * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND * * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL EWM BE LIABLE * * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES * * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS * * OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING * * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN * * IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * * * For any issues regarding this software and its use, contact EWM at: * * * * Eugene W. Myers Jr. * * Bautzner Str. 122e * * 01099 Dresden * * GERMANY * * Email: gene.myers@gmail.com * * * \************************************************************************************/ /************************************************************************************\ UPGRADE & DEVELOPER NOTES ! ! ! If you have already performed a big comparison and don't want to recompute all your local alignments in .las files, but do want to use a more recent version of the software that entails a change to the data structures (currently the update on December 31, 2014), please note the routine LAupgrade.31.2014. This take an .las file, say X.las, as an argument, and writes to standard output the .las file in the new format. The program can be made with "make" but is not by default created when make is called without an argument. For those interested in the details, on December 30, the "alen" and "blen" fields were dropped to save space as they can always be gotten from the underlying DB. \************************************************************************************/ The Daligner Overlap Library Author: Gene Myers First: July 17, 2013 Current: December 31, 2014 The commands below permit one to find all significant local alignments between reads encoded in Dazzler database. The assumption is that the reads are from a PACBIO RS II long read sequencer. That is the reads are long and noisy, up to 15% on average. Recall that a database has a current partition that divides it into blocks of a size that can conveniently be handled by calling the "dalign" overlapper on all the pairs of blocks producing a collection of .las local alignment files that can then be sorted and merged into an ordered sequence of sorted files containing all alignments between reads in the data set. The alignment records are parsimonious in that they do not record an alignment but simply a set of trace points, typically every 100bp or so, that allow the efficient reconstruction of alignments on demand. 1. daligner [-vbAI] [-k<int(14)>] [-w<int(6)>] [-h<int(35)>] [-t<int>] [-M<int>] [-e<double(.70)] [-l<int(1000)] [-s<int(100)>] [-H<int>] [-m<track>]+ <subject:db|dam> <target:db|dam> ... Compare sequences in the trimmed <subject> block against those in the list of <target> blocks searching for local alignments involving at least -l base pairs (default 1000) or more, that have an average correlation rate of -e (default 70%). The local alignments found will be output in a sparse encoding where a trace point on the alignment is recorded every -s base pairs of the a-read (default 100bp). Reads are compared in both orientations and local alignments meeting the criteria are output to one of several created files described below. The -v option turns on a verbose reporting mode that gives statistics on each major step of the computation. The options -k, -h, and -w control the initial filtration search for possible matches between reads. Specifically, our search code looks for a pair of diagonal bands of width 2^w (default 2^6 = 64) that contain a collection of exact matching k-mers (default 14) between the two reads, such that the total number of bases covered by the k-mer hits is h (default 35). k cannot be larger than 32 in the current implementation. If the -b option is set, then the daligner assumes the data has a strong compositional bias (e.g. >65% AT rich), and at the cost of a bit more time, dynamically adjusts k-mer sizes depending on compositional bias, so that the mers used have an effective specificity of 4^k. If there are one or more interval tracks specified with the -m option, then the reads of the DB or DB's to which the mask applies are soft masked with the union of the intervals of all the interval tracks that apply, that is any k-mers that contain any bases in any of the masked intervals are ignored for the purposes of seeding a match. An interval track is a track, such as the "dust" track created by DBdust, that encodes a set of intervals over either the untrimmed or trimmed DB. Invariably, some k-mers are significantly over-represented (e.g. homopolymer runs). These k-mers create an excessive number of matching k-mer pairs and left unaddressed would cause daligner to overflow the available physical memory. One way to deal with this is to explicitly set the -t parameter which suppresses the use of any k-mer that occurs more than t times in either the subject or target block. However, a better way to handle the situation is to let the program automatically select a value of t that meets a given memory usage limit specified (in Gb) by the -M parameter. By default daligner will use the amount of physical memory as the choice for -M. If you want to use less, say only 8Gb on a 24Gb HPC cluster node because you want to run 3 daligner jobs on the node, then specify -M8. Specifying -M0 basically indicates that you do not want daligner to self adjust k-mer suppression to fit within a given amount of memory. For each subject, target pair of blocks, say X and Y, the program reports alignments where the a-read is in X and the b-read is in Y, and vice versa. However, if the -A option is set ("A" for "asymmetric") then just overlaps where the a-read is in X and the b-read is in Y are reported, and if X = Y, then it further reports only those overlaps where the a-read index is less than the b-read index. In either case, if the -I option is set ("I" for "identity") then when X = Y, overlaps between different portions of the same read will also be found and reported. Each found alignment is recorded as -- a[ab,ae] x bo[bb,be] -- where a and b are the indices (in the trimmed DB) of the reads that overlap, o indicates whether the b-read is from the same or opposite strand, and [ab,ae] and [bb,be] are the intervals of a and bo, respectively, that align. The program places these alignment records in files whose name is of the form X.Y.[C|N]#.las where C indicates that the b-reads are complemented and N indicates they are not (both comparisons are performed) and # is the thread that detected and wrote out the collection of alignments contained in the file. That is the file X.Y.O#.las contains the alignments produced by thread # for which the a-read is from X and the b-read is from Y and in orientation O. The command "daligner -A X Y" produces 2*NTHREAD thread files X.Y.?.las and "daligner X Y" produces 4*NTHREAD files X.Y.?.las and Y.X.?.las (unless X=Y in which case only NTHREAD files, X.X.?.las, are produced). By default daligner compares all overlaps between reads in the database that are greater than the minimum cutoff set when the DB or DBs were split, typically 1 or 2 Kbp. However, the HGAP assembly pipeline only wants to correct large reads, say 8Kbp or over, and so needs only the overlaps where the a-read is one of the large reads. By setting the -H parameter to say N, one alters daligner so that it only reports overlaps where the a-read is over N base-pairs long. While the default parameter settings are good for raw Pacbio data, daligner can be used for efficiently finding alignments in corrected reads or other less noisy reads. For example, for mapping applications against .dams we run "daligner -k20 -h60 -e.85" and on corrected reads, we typically run "daligner -k25 -w5 -h60 -e.95 -s500" and at these settings it is very fast. 2. LAsort [-v] <align:las> ... Sort each .las alignment file specified on the command line. For each file it reads in all the overlaps in the file and sorts them in lexicographical order of (a,b,o,ab) assuming each alignment is recorded as a[ab,ae] x b^o[bb,be]. It then writes them all to a file named <align>.S.las (assuming that the input file was <align>.las). With the -v option set then the program reports the number of records read and written. 3. LAmerge [-v] <merge:las> <parts:las> ... Merge the .las files <parts> into a singled sorted file <merge>, where it is assumed that the input <parts> files are sorted. Due to operating system limits, the number of <parts> files must be <= 252. With the -v option set the program reports the # of records read and written. Used correctly, LAmerge and LAsort together allow one to perform an "external" sort that produces a collection of sorted files containing in aggregate all the local alignments found by the daligner, such that their concatenation is sorted in order of (a,b,o,ab). In particular, this means that all the alignments for a given a-read will be found consecutively in one of the files. So computations that need to look at all the alignments for a given read can operate in simple sequential scans of these sorted files. 4. LAshow [-carUF] [-i<int(4)>] [-w<int(100)>] [-b<int(10)>] <src1:db|dam> [ <src2:db|dam> ] <align:las> [ <reads:FILE> | <reads:range> ... ] LAshow produces a printed listing of the local alignments contained in the specified .las file, where the a- and b-reads come from src1 or from src1 and scr2, respectively. If a file or list of read ranges is given then only the overlaps for which the a-read is in the set specified by the file or list are displayed. See DBshow for an explanation of how the file and list of read ranges are interpreted. If the -F option is set then the roles of the a- and b- reads are reversed in the display. If the -c option is given then a cartoon rendering is displayed, and if -a or -r option is set then an alignment of the local alignment is displayed. The -a option puts exactly -w columns per segment of the display, whereas the -r option puts exactly -w a-read symbols in each segment of the display. The -r display mode is useful when one wants to visually compare two alignments involving the same a-read. If a combination of the -c, -a, and -r flags is set, then the cartoon comes first, then the -a alignment, and lastly the -r alignment. The -i option sets the indent for the cartoon and/or alignment displays, if they are requested. The -b option sets the number of symbols on either side of the aligned segments in an alignment display, and -U specifies that uppercase should be used for DNA sequence instead of the default lowercase. 5. LAcat <source:las> > <target>.las Given argument <source>, find all files <source>.1.las, <source>.2.las, ... <source>.n.<las> where <source>.i.las exists for every i in [1,n]. Then concatenate these files in order into a single .las file and pipe the result to the standard output. 6. LAsplit <target:las> (<parts:int> | <path:db|dam>) < <source>.las If the second argument is an integer n, then divide the alignment file <source>, piped in through the standard input, as evenly as possible into n alignment files with the name <target>.i.las for i in [1,n], subject to the restriction that all alignment records for a given a-read are in the same file. If the second argument refers to a database <path>.db that has been partitioned, then divide the input alignment file into block .las files where all records whose a-read is in <path>.i.db are in <align>.i.las. 7. LAcheck [-vS] <src1:db|dam> [ <src2:db|dam> ] <align:las> ... LAcheck checks each .las file for structural integrity, where the a- and b-sequences come from src1 or from src1 and scr2, respectively. That is, it makes sure each file makes sense as a plausible .las file, e.g. values are not out of bound, the number of records is correct, the number of trace points for a record is correct, and so on. If the -S option is set then it further checks that the alignments are in sorted order. If the -v option is set then a line is output for each .las file saying either the file is OK or reporting the first error. If the -v option is not set then the program runs silently. The exit status is 0 if every file is deemed good, and 1 if at least one of the files looks corrupted. 8. HPCdaligner [-vbAI] [-k<int(14)>] [-w<int(6)>] [-h<int(35)>] [-t<int>] [-M<int>] [-e<double(.70)] [-l<int(1000)] [-s<int(100)>] [-H<int>] [-m<track>]+ [-dal<int(4)>] [-deg<int(25)>] <path:db|dam> [<first:int>[-<last:int>]] HPCdaligner writes a UNIX shell script to the standard output that consists of a sequence of commands that effectively run daligner on all pairs of blocks of a split database and then externally sorts and merges them using LAsort and LAmerge into a collection of alignment files with names <path>.#.las where # ranges from 1 to the number of blocks the data base is split into. These sorted files if concatenated by say LAcat would contain all the alignments in sorted order (of a-read, then b-read, ...). Moreover, all overlaps for a given a-read are guaranteed to not be split across files, so one can run artifact analyzers or error correction on each sorted file in parallel. The data base must have been previously split by DBsplit and all the parameters, except -v, -dal, and -deg, are passed through to the calls to daligner. The defaults for these parameters are as for daligner. The -v flag, for verbose-mode, is also passed to all calls to LAsort and LAmerge. -dal and -deg options are described later. For a database divided into N sub-blocks, the calls to daligner will produce in total 2TN^2 .las files assuming daligner runs with T threads. These will then be sorted and merged into N^2 sorted .las files, one for each block pair. These are then merged in ceil(log_deg N) phases where the number of files decreases geometrically in -deg until there is 1 file per row of the N x N block matrix. So at the end one has N sorted .las files that when concatenated would give a single large sorted overlap file. The -dal option (default 4) gives the desired number of block comparisons per call to daligner. Some must contain dal-1 comparisons, and the first dal-2 block comparisons even less, but the HPCdaligner "planner" does the best it can to give an average load of dal block comparisons per command. The -deg option (default 25) gives the maximum number of files that will be merged in a single LAmerge command. The planner makes the most even k-ary tree of merges, where the number of levels is ceil(log_deg N). If the integers <first> and <last> are missing then the script produced is for every block in the database. If <first> is present then HPCdaligner produces an incremental script that compares blocks <first> through <last> (<last> = <first> if not present) against each other and all previous blocks 1 through <first>-1, and then incrementally updates the .las files for blocks 1 through <first>-1, and creates the .las files for blocks <first> through <last>. Each UNIX command line output by the HPCdaligner can be a batch job (we use the && operator to combine several commands into one line to make this so). Dependencies between jobs can be maintained simply by first running all the daligner jobs, then all the initial sort jobs, and then all the jobs in each phase of the external merge sort. Each of these phases is separated by an informative comment line for your scripting convenience. 9. HPCmapper [-vb] [-k<int(20)>] [-w<int(6)>] [-h<int(50)>] [-t<int>] [-M<int>] [-e<double(.85)] [-l<int(1000)] [-s<int(100)>] [-H<int>] [-m<track>]+ [-dal<int(4)>] [-deg<int(25)>] <ref:db|dam> <reads:db|dam> [<first:int>[-<last:int>]] HPCmapper writes a UNIX shell script to the standard output that consists of a sequence of commands that effectively "maps" every read in the DB <reads> against a reference set of sequences in the DB <ref>, recording all the found local alignments in the sequence of files <ref>.<reads>.1.las, <ref>.<reads>.2.las, ... where <ref>.<reads>.k.las contains the alignments between all of <ref> and the k'th block of <reads>. The parameters are exactly the same as for HPCdaligner save that the -k, -h, and -e defaults are set appropriately for mapping, and the -A and -I options make no sense as <ref> and <reads> are expected to be distinct data sets. If the integers <first> and <last> are missing then the script produced is for every block in the database <reads>. If <first> is present then HPCmapper produces an script that compares blocks <first> through <last> (<last> = <first> if not present) against DAM <ref>. Example: // Recall G.db from the example in DAZZ_DB/README > cat G.db files = 1 1862 G Sim blocks = 2 size = 11 cutoff = 0 all = 0 0 0 1024 1024 1862 1862 > HPCdaligner -mdust -t5 G | csh -v // Run the HPCdaligner script # Dazzler jobs (2) dazzler -d -t5 -mdust G.1 G.1 dazzler -d -t5 -mdust G.2 G.1 G.2 # Initial sort jobs (4) LAsort G.1.G.1.*.las && LAmerge G.L1.1.1 G.1.G.1.*.S.las && rm G.1.G.1.*.S.las LAsort G.1.G.2.*.las && LAmerge G.L1.1.2 G.1.G.2.*.S.las && rm G.1.G.2.*.S.las LAsort G.2.G.1.*.las && LAmerge G.L1.2.1 G.2.G.1.*.S.las && rm G.2.G.1.*.S.las LAsort G.2.G.2.*.las && LAmerge G.L1.2.2 G.2.G.2.*.S.las && rm G.2.G.2.*.S.las # Level 1 jobs (2) LAmerge G.1 G.L1.1.1 G.L1.1.2 && rm G.L1.1.1.las G.L1.1.2.las LAmerge G.2 G.L1.2.1 G.L1.2.2 && rm G.L1.2.1.las G.L1.2.2.las > LAshow -c -a:G -w50 G.1 | more // Take a look at the result ! G.1: 34,510 records 1 9 c [ 0.. 1,876] x [ 9,017..10,825] ( 18 trace pts) 12645 A ---------+====> dif/(len1+len2) = 398/(1876+1808) = 21.61% B <====+--------- 9017 1 ..........gtg-cggt--caggggtgcctgc-t-t-atcgcaatgtta |||*||||**||||||||*||||*|*|*||**|*|*|||| 9008 gagaggccaagtggcggtggcaggggtg-ctgcgtcttatatccaggtta 27.5% 35 ta-ctgggtggttaaacttagccaggaaacctgttgaaataa-acggtgg ||*|||||||||||||*|**|*||*|*||||||*|**|||||*|*||||| 9057 tagctgggtggttaaa-tctg-ca-g-aacctg-t--aataacatggtgg 24.0% 83 -ctagtggcttgccgtttacccaacagaagcataatgaaa-tttgaaagt *||||||||*||||||||*||**||||*|||**|||||||*||||*|||| 9100 gctagtggc-tgccgttt-ccgcacag-agc--aatgaaaatttg-aagt 20.0% 131 ggtaggttcctgctgtct-acatacagaacgacggagcgaaaaggtaccg ||*|||||||||||||*|*||||*|*|*||||||||||*||||||||||* 9144 gg-aggttcctgctgt-tcacat-c-ggacgacggagc-aaaaggtacc- 16.0% ... > LAcat G >G.las // Combine G.1.las & G.2.las into a single .las file > LAshow G G | more // Take another look, now at G.las G: 62,654 records 1 9 c [ 0.. 1,876] x [ 9,017..10,825] : < 398 diffs ( 18 trace pts) 1 38 c [ 0.. 7,107] x [ 5,381..12,330] : < 1,614 diffs ( 71 trace pts) 1 49 n [ 5,493..14,521] x [ 0.. 9,065] : < 2,028 diffs ( 91 trace pts) 1 68 n [12,809..14,521] x [ 0.. 1,758] : < 373 diffs ( 17 trace pts) 1 147 c [ 0..13,352] x [ 854..14,069] : < 2,993 diffs (133 trace pts) 1 231 n [10,892..14,521] x [ 0.. 3,735] : < 816 diffs ( 37 trace pts) 1 292 c [ 3,835..14,521] x [ 0..10,702] : < 2,353 diffs (107 trace pts) 1 335 n [ 7,569..14,521] x [ 0.. 7,033] : < 1,544 diffs ( 70 trace pts) 1 377 c [ 9,602..14,521] x [ 0.. 5,009] : < 1,104 diffs ( 49 trace pts) 1 414 c [ 6,804..14,521] x [ 0.. 7,812] : < 1,745 diffs ( 77 trace pts) 1 415 c [ 0.. 3,613] x [ 7,685..11,224] : < 840 diffs ( 36 trace pts) 1 445 c [ 9,828..14,521] x [ 0.. 4,789] : < 1,036 diffs ( 47 trace pts) 1 464 n [ 0.. 1,942] x [12,416..14,281] : < 411 diffs ( 19 trace pts) ...