triplexator: A C++ repository from Gurado

*** Triplexator - Finding Nucleic Acid Triple Helices ***

---------------------------------------------------------------------------
Table of Contents
---------------------------------------------------------------------------
  1.   Overview
  2.   Installation
  3.   Usage
  4.   Output Format
  5.   Examples
  6.   Contact

---------------------------------------------------------------------------
1. Overview
---------------------------------------------------------------------------

Triplexator is a tool for detecting nucleic acid triple helices and triplex
features in nucleotide sequences using the canonical triplex-formation
rules.

---------------------------------------------------------------------------
2. Installation 
---------------------------------------------------------------------------
---------------------------------------------------------------------------
2.1 Installation - binaries
---------------------------------------------------------------------------

Triplexator binaries are available for some plattforms from 
  http://code.google.com/p/triplexator/downloads/list

  1) untar:
     tar xf triplexator.<plattform>.tar.gz
  2) change directory:
     cd triplexator
  3) run triplexator:
     ./bin/triplexator --help
     This should output a brief usage message.

---------------------------------------------------------------------------
2.2 Installation - from source
---------------------------------------------------------------------------

Triplexator sources can be obtained from googlecode using git and build 
using cmake:

    1) obtain triplexator:
    >git clone https://code.google.com/p/triplexator/ triplexator
    
    2) change directory:
      >cd triplexator
    
    3) create directory and change into it:
      >mkdir -p build/Release && cd build/Release
    
    4) run cmake and make:
      >cmake ../.. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=g++ -G "Unix Makefiles" && make
    
    5) change directory:
      >cd ../..
    
    6) try the binary:
      >./bin/triplexator --help
    
    7) run the smoketest:
      >./demos/smoketest_triplexator.sh ./bin/triplexator 

On success, an executable file triplexator was build and a brief usage 
description has been dumped.

---------------------------------------------------------------------------
3. Usage
---------------------------------------------------------------------------

To get a short usage description of Triplexator, you can execute 
triplexator -h or  triplexator --help.

Usage: triplexator [OPTION]... -ss <SINGLE-STRANDED FILE> -ds <DUPLEX FILE>

Triplexator expects the names of one or two DNA/RNA (multi-) Fasta files and 
runs in different operative modes depending on which data are input. 


  [ -ss <FILE> ],  [ --single-strand-file <FILE> ]

  File in FASTA format that is searched for triplex-forming capability 
  (e.g. RNA or DNA). If only this file is supplied, Triplexator will search 
  and output putative Triplex Forming Oligonucleotides (TFOs) only.

  [ -ds <FILE> ],  [ --duplex-file <FILE> ]

  File in FASTA format that is searched for triplex-forming capability (e.g. DNA)
  If only these files are supplied <span id="sc">Triplexator</span> will search 
  and output Triplex Target Sites (TTSs) only.


---------------------------------------------------------------------------
3.1. Main Options
---------------------------------------------------------------------------

  [ -l NUM ],  [ --lower-length-bound NUM ]
  
  Specifies the minimum length of a TFO, TTS or triplex (TTS-TFO pair)
  
  [ -u NUM ],  [ --upper-length-bound NUM ]
  
  Specifies the maximum length of a TFO, TTS or triplex (TTS-TFO pair), 
  -1 = unrestricted  (default -1)
  
  [ -m ],  [ --triplex-motifs MOTIF1,MOTIF2,... ]
  
  Specifies the motifs from the canonical triplex-formation rules to be 
  used when searching for TFOs in the third strand:
  R - the purine motif that permit guanines (G) and adenines (A).
  Y - the pyrimidine motif that permit cytosines (C) and thymines (T).
  M - the mixed motif, purine-pyrimdine, that permit guanines (G) and 
      thymines (T).
  P - parallel binding, i.e. motifs facilitating Hoogsten bonds;
      this covers the pyrimidine motif and the purine-pyrimidine motif
	  in parallel configuration.
  A - anti-parallel binding, i.e. motifs facilitating reverse Hoogsten 
      bonds; this covers the purine motif and the purine-pyrimidine motif
	  in anti-parallel configuration.
	  
  By default all motifs are used.

  [ -mpmg NUM ],  [ --mixed-parallel-max-guanine NUM ]
  
  Specifies the maximum guanine proportion (in %) in a mixed-motif triplexes 
  (GT) to consider this feature for parallel binding (Hoogsteen bonds) 
  (default 100).
  
  As GT-TFOs can bind in either orientation this parameter can be used 
  to specify at which guanine content a GT-TFO should not be able to 
  bind in parallel orientation
  (because anti-parallel binding will always dominate due to high G content).
  
  Works in conjunction with --mixed-antiparallel-min-guanine but 
  parameters are keep separate as there may be a smooth transition
  between the binding modes.
  
  [ -mamg NUM ], [ --mixed-antiparallel-min-guanine NUM ]
  
  Specifies the minimum guanine proportion (in %) in a mixed-motif triplexes 
  (GT) to consider this feature for anti-parallel binding (reverse Hoogsteen 
  bonds) (default 0).

  As GT-TFOs can bind in either orientation this parameter can be used 
  to specify at which guanine content a GT-TFO should not be able to 
  bind in anti-parallel orientation.
  (because parallel binding will always dominate due to the low G content).
  
  Works in conjunction with --mixed-parallel-max-guanine but 
  parameters are keep separate as there may be a smooth transition
  between the binding modes.
  
  [ -e NUM ],  [ --error-rate NUM ]

  Set the maximal error-rate in % tolerated (default 5). 
  Triplexator searches for matches with an error-rate percent 
  of at most NUM. A match of a feature R with E errors has 
  error-rate of 100*(E/|R|), whereby |R| is the feature length. 
  In other words, a feature is allowed to have not more than 
  |R|*ceil(NUM)/100 errors.

  [ -E NUM ],  [ --maximal-error NUM ]
  
  Set the maximal overall error tolerated, disable with -1  (default -1).
  The maximal overall error is a hard threshold that can be used in 
  conjunction with the error-rate (see above).
  For example, in a scenario using an error-rate of 10% and a 
  maximal error of 3, the error-rate will be the limiting factor up to
  features of length 30, after which the maximal error takes over.
  
  [ -c NUM ],  [ --consecutive-errors NUM ]
  
  Sets the tolerated number of consecutive errors with respect to the
  canonical triplex rules as such were found to greatly destabilize 
  triplexes in vitro. The maximum permitted number is 3.
  
  [ -g NUM ],  [ --min-guanine NUM ]
  
  Set the minimum guanine conten in triplex features. NUM must be a value
  between 0 and 100 (default is 10). 
  The minimum guanine rate controls the ratio of guanines required in the 
  any triplex target site. For triplex-forming oligonucleotide this  
  constraint will be applied to their respective target. 
 
  [ -G NUM ],  [ --max-guanine NUM ]
  
  Set the maximum guanine conten in triplex features. NUM must be a value
  between 0 and 100 (default is 100). 
  The maximum guanine rate controls the ratio of guanines required in the 
  any triplex target site. For triplex-forming oligonucleotide this  
  constraint will be applied to their respective target. 
 
  [ -b NUM ],  [ --minimum-block-run NUM ]
  
  Sets the number of consecutive matches required in a feature 
  discarding any feature that violates this constrait (default 1).
  The rational behind this parameter is that a seed of consecutive matching
  positions of a given length is required to initiate triplex formation.
  Given the observation that central errors are more disruptive than 
  errors at the flanks of a triplex, this parameter will be especially
  effective for short features.
  
  Example: 
  feature 			valid -b	discarded with -b
  AGGAGAGtGAGAAAGA    <= 8	         >= 9
  AGGAGAGGAGAAAtGA    <= 13	         >= 14	
  
  [ -a ],   [--all-matches ]
  
  Flag indicates that all qualifying sub-matches should be processed and
  reported in addition to the longest match.
  Careful! This can result in hugh output files when searching for 
  TFO-TTS pairs (i.e. providing single-stranded and doubles-stranded input).
  		

  [ -mf NUM],  [ --merge-features NUM ]

   merge overlapping features into a cluster and report the spanning region
   Only supported for TFO and TTS detection, respectively. 
   For TFO-TTS pairs (triplexes) features are merged in the TFO and TTS
   detection phase on default.
   Any merge is performed before duplicate detection (-dd).
  		
  [ -dd NUM],  [ --detect-duplicates NUM ]

  Indiates whether and how duplicates should be detected (default 0). 
  Choices are:
  
  0 = off         do not detect any duplicates
  1 = permissive  detect duplicates in feature space, 
                  e.g. AGGGAcGAGGA != AGGGAtGAGGA
  2 = strict      detect duplicates in target space, 
                  e.g. AGGGAcGAGGA == AGGGAtGAGGA == AGGGAYGAGGA
				
  Detection of duplicates requires all input sequence to be present in 
  memory at the same time, which will increase memory consumption 
  particularly when whole genomes are under investigation. 
  
  It is further advised to enable filtering of repeat and low complexity
  regions to minimize the workload during duplicate detection. 

  [ -ssd [on|off] ], [--same-sequence-duplicates [on|off] ]
  
  Whether to count a feature copy in the same sequence as duplicates 
  or not. (default off)
 
  [ -v ],  [ --verbose ]
  
  Verbose. Print extra information and running times.

  [ -vv ],  [ --vverbose ]

  Very verbose. Like -v, but also print filtering statistics like true and
  false positives (TP/FP).

  [ -V ],  [ --version ]
  
  Print version information.

  [ -h ],  [ --help ]

  Print a brief usage summary.

---------------------------------------------------------------------------
3.2. Output Format Options
---------------------------------------------------------------------------

  [ -z ],  [ --zip ]
  
  Compress output with gzip on-the-fly.
  Requires gzip and boost libraries during compilation.
  
  [ -dl ],  [ --duplicate-locations ]
  
  If enabled, the locations of duplicates are reported for individual
  triplex features. Requires the setting of --duplicates-cutoff to > 0.

  [ -ns [on|off] ], [ --normalized-score [on|off] ]
  
  Whether to compute the triplex potenial normalized over the sequences 
  (default off).

  [ -o FILE ],  [ --output FILE ]

  Change the output filename to FILE. By default, this is the input file
  name extended by the suffix ".TFO", ".TTS" or ".TRIPLEX" depending on
  the operative mode.

  [ -od FILEDIR ],  [ --output-directory FILEDIR ]

  Specifies the output directory where the result files will be written.
  By default the current directory is used.
  
  [ -of NUM ],  [ --output-format NUM ]

  Select the output format the matches should be stored in. See section 4.
 
  [ -po ],  [ --pretty-output ]

  Pretty output indicates matches with capital letters and deviations from 
  the triplex-formation rules by small letters.
  
  [ -er NUM ],  [ --error-reference NUM ]
  
  Sets the reference to which the error should correspond (default 0)
  0 = the Watson strand of the target (TTS)
  1 = the purine strand of the target (TTS)
  2 = the third strand (TFO)
  
---------------------------------------------------------------------------
3.3. Filtration Options
---------------------------------------------------------------------------

  [ -fm NUM],  [ --filtering-mode NUM ]
  
  Method to quickly discard non-hits (default 1). 
  
  0 = brute-force approach      use no filtering, go the extra mile
  1 = q-gram filtering          filter hits using qgrams
  
  G-gram filtering will use more memory but can improve runtime.
  The greedy approach, however, will catch up on runtime
  when the q-grams get very small, due to an high error-rate,
  small minimum triplex length or disabled repeat filtering.

  The qgram weight is calculated as followed:
  min(14.0,floor((qgramThreshold -1 -minLength)/-(ceil(errorRate*minLength)+1)))

  [ -t NUM ],  [ --qgram-threshold NUM ]
  
  Minimal number of q-grams required per potential hit (default 2).
  A higher threshold means more stringent filtering therefore requiring
  fewer validations but also leads to shorter qgrams, which increases the
  number of lookups.
                                             
  [ -fr ],  [ --filter-repeats NUM ]
  
  Activates the filtering of low complexity regions and repeats in the 
  sequence data. This option can greatly decrease the memory consumption 
  and runtime of Triplexator. However, many repeat regions comply to 
  triplex-formation rules. Hence this option is deactivated by default.
  
  [ -mrl NUM ] [ --minimum-repeat-length NUM ]
  
  Only considered with -r. Specifies the minimum length of a repeat 
  
  [ -mrp NUM ],  [ --maximum-repeat-period NUM ]
  
  Only considered with -r. Maximum period that defines a repeat or low 
  complexity region. 
  
  [ -dc NUM ],  [ --duplicates-cutoff NUM ]
  
  Feature is disregarded if it occurs more often than specified with
  this cutoff. Disable filtering by setting cutoff to -1. (default -1)

---------------------------------------------------------------------------
3.4. Performance Options
---------------------------------------------------------------------------

  Performance Options are only available if Triplexator has been compiled
  with OpenMP enabled.

  [ -rm NUM ],  [ --runtime-mode NUM ]
  
  The computational bottle-neck of triplexator is the matching of TFOs with 
  their putative targets when detecting the triplexes. Therefore, 
  Triplexator can leverage multi-processor architectures on bases of OpenMP.
  
  Depending on the dataset and the computational resources different 
  runtime modes can be chosen from. See below for additional information.
  
  0 = Serial (default)
  1 = Parallelize TTSs
  2 = Parallelize duplexes
  
  In case of memory capacity issues it can be helpful to divide the 
  single-strand sequence file into several smaller chunks and to execute
  Triplexator on each of them. Is can also be helpful to use this approach
  and distribute the smaller task over a cluster.
  
  [ -p NUM ],  [ --processors NUM ]
  
  Number of processors used when executed in parallel mode.
  Specify -1 to detect automatically. (default -1)
  
---------------------------------------------------------------------------
3.4.1 Serial
---------------------------------------------------------------------------

  The default option performs the triplex-matching serially. This option
  is a good tradeof if memory is a constraint or Triplexator is run on a 
  one processor achitecture.
  
---------------------------------------------------------------------------
3.4.2 Parallelize triplex target sites (TTSs)
---------------------------------------------------------------------------

  In case the duplex sequences are rather long, i.e. chromosomes, this 
  is the appropriate runtime-mode. Parallelize triplex matches evaluates  
  one duplex sequence at a time but parallelize the matching of all its  
  putative triplex target sites when searching for suitable partners in the 
  single-strand sequence set.
 
---------------------------------------------------------------------------
3.4.3 Parallelize duplexes
---------------------------------------------------------------------------

  This is the appropriate runmode-option in case many rather small duplex 
  sequences are searched for their triplex potential. Parallelize duplexes
  reads all duplex sequences into memory and performs the triplex search 
  in parallel trading runtime for memory consumption.  
 
---------------------------------------------------------------------------
4. Output Formats
---------------------------------------------------------------------------

  [ -of NUM ],  [ --output-format NUM ]

  Triplexator supports currently 3 different output formats:

  0 = Tab-separated Format + Summary Format
  1 = Triplexator Format + Summary Format
  2 = Summary Format only
  
  All output formats are sensitive to the operative mode that Triplexator
  runs in, i.e. the results for the search of TFOs, TTSs and triplexes.
  
  In addition a log file will be generated each time Triplexator is run.
 
---------------------------------------------------------------------------
4.1. Tab-separated Format
---------------------------------------------------------------------------

  The tabulator separated file contains a header line indicated with an "#"
  followed by the meaning of each column (header) as indicated below by a 
  separating pipe symbol "|". Each line corresponds to one idividual entry 
  of a triplex feature (TFO/TTS) or triplex match.

  Searching putative triplex-forming oligonucleotides:
    Sequence-ID|Start|End|Score|Motif|Error-rate|Errors|Guanine-rate|
    			...Duplicates|TFO
	
  Searching putative triplex target sites:
	Duplex-ID|Start|End|Score|Strand|Error-rate|Errors|Guanine-rate|
				...Duplicates|TTS
	
  Searching putative triplexes:
	Sequence-ID|TFO start|TFO end|Duplex-ID|TTS start|TTS end|Score|Error-rate|
	            ...Motif|Strand|Orientation|Guanine-rate
	
  with the following meaning:
  - Sequence-ID     : id of the single-stranded sequence providing the TFO
  - Duplex-ID       : id of the double-stranded sequence providing the TTS
  - Start [TFO/TTS] : start index of the feature
  - End [TFO/TTS]   : end index of the feature
  - Score           : score of the feature/triplex (number of matches)
  - Motif           : Binding motif of the canonical triplex rules used  
                      to find this TFO (R|Y|M)
  - Error-rate      : rate of deviations from the canonical rules for this 
                      feature/triplex
  - Errors          : Deviation from the triplex rulesets are encoded with
                      respect to the participant that contains the 
                      deviation, i.e "d3" means that in the duplex 
                      nucleotide 3 does not match the rules (starting at 0). 
                      "o5" means that in the third strand oligonucleotide 
                      the 5th position deviates and "b3" means that both 
                      participants contain an error, while "t3" means that 
                      the participants are conform to their individual rules
                      but don't match. 
                      Coordinates depend on the setting of the parameter 
                      --error-reference.
  - Duplicates      : Number of times this triplex feature or respective
                      target does occur in the corresponding sequence set
					  NOT supported when searching for triplexes.
  - TFO/TTS         : sequence of the TFO/TTS
  - Strand          : strand of the duplex providing the poly-purine tract
  - Orientation     : parallel or anti-parallel binding of the TFO to the 
                      TTS [P|A]
  - Guanine-rate    : proportion of guanines w/r/t the target able to 
                      participate in triplex formation                  
---------------------------------------------------------------------------
4.2. Triplexator Format
---------------------------------------------------------------------------

  This format is complies to fasta format when searching either for TFOs or
  for TTSs. When searching TFO-TTS pairs a visual representation of the
  alignment is given as well.

---------------------------------------------------------------------------
4.3. Summary Format
---------------------------------------------------------------------------

  The summary file will be generated each time Triplexator is started. 
  The summary file is a tab-separated file that aggregates triplex feature 
  and triplex hits for the corresponding sequence(s).
  
  The absolute number of features/triplexes is given for all motifs 
  individually (abs), while the relative measure (triplex potential) is 
  adjusted for the sequence and feature length of the sequence(s).
  
  Note, to save disk space only sequences that have at least one triplex
  feature are output.
  
  Searching triplex-forming oligonucleotides:
    Sequence-ID|TFOs (abs)|TFOs (rel)|GA (abs)|GA (rel)|TC (abs)|TC (rel)|
	            ...GT (abs)|GT (rel)
	
  Searching triplex target sites:
	Duplex-ID|TTSs (abs)|TTSs (rel)
	
  Searching triplexes:
	Duplex-ID|Sequence-ID|Total (abs)|Total (rel)|GA (abs)|GA (rel)|TC (abs)|
	            ...TC (rel)|GT (abs)|GT (rel)
  
  with the following meaning:
  - Sequence-ID     : id of the single-stranded sequence providing the TFO
  - Duplex-ID       : id of the double-stranded sequence providing the TTS
  - TFOs (abs)      : absolute number of maximal TFOs found in the sequence
                      over all triplex motifs   
  - TFOs (rel)      : length-adjusted triplex potential wrt. TFO features
                      over all triplex motifs 
  - TTSs (abs)      : absolute number of maximal TTSs found in the sequence
  - TTSs (rel)      : length-adjusted triplex potential wrt. TTS features
  - Total (abs)     : absolute number of maximal TFO-TTS pairs found in the 
                      two sequences
  - Total (rel)     : length-adjusted triplex potential wrt. TFO-TTS pairs
                      found in the two sequences
  - GA/TC/GT (abs)  : absolute number of maximal TFOs or TFO-TTS pairs 
                      (depending on the context)
  - GA/TC/GT (rel)  : length-adjusted triplex potential or TFOs or TFO-TTS 
                      pairs wrt. the specified motif   
                      (depending on the context)

---------------------------------------------------------------------------
5. Examples
---------------------------------------------------------------------------

---------------------------------------------------------------------------
5.1. Identify TFOs in single-strand sequences
---------------------------------------------------------------------------

  We want to find all putative triplex-forming oligonucleotide in a set of 
  <transcripts> subject to the following specifics:
  
	- at least 20 bps in length "-l 20"
	- having at most 15% errors in the motif "-e 15"
	- we may be only interested in TFOs that form triplexes of the 
	  purine-pyrimidine motif "-m M"
	- we want to remove low complexity regions of length >= 7 and period <=1 
	  (e.g. for polyA filtering) "-fr on -mrl 7 -mrp 1" 
	- output all sites "-of 0"
	- output to the file names <transcripts>.TFO "-o <transcripts>.TFO"
	- place the results in a specific forder "-od <folder>"

  >triplexator -l 20 -e 15 -m M -fr on -mrl 7 -mrp 1 -of 0 -od <folder> 
		-o <transcripts>.TFO -ss <transcripts>.fasta 
  
---------------------------------------------------------------------------
5.2. Identify high quality putative TTSs in a genome 
---------------------------------------------------------------------------

  We want to find all putative target sites in <genome>, which comply to 
  the following specifics:
  
	- at least 15 bps in length "-l 15"
	- containing at least 50% guanines "-g 50"
	- having at most 10% pyrimidine interruptions "-e 10"
	- filtered for low complexity regions of length >= 7 and period <=3 
	  "-fr on -mrl 7 -mrp 3"
	- at most 5 duplicates in the whole genome "-dc 5"
	- output all sites "-of 0"
	- output to the file names <genome>.TTS "-o <genome>.TTS" 
	
  >triplexator -l 15 -g 50 -e 10 -fr on -mrl 7 -mrp 3 -dc 5 -of 0 
        -o <genome>.TTS -ds <genome>.fasta 

---------------------------------------------------------------------------
5.3. Identify TFO-TTS pairs in single-strand and duplex sequences
---------------------------------------------------------------------------

  We want to find all putative triplexes that can form between a set of 
  <transcripts> and <promoters>, which comply to the following specifics:
  
	- at least 20 bps in length "-l 20"
	- having at most 5% mismatches and errors "-e 5"
	- filtered for low complexity regions of length >= 7 and period <=3 
	  "-fr on -mrl 7 -mrp 3"
	- output the alignments "-of 1"
	- we like to look at the alignments so make them pretty "-po "
	- output to the file names <transcripts_promoters>.TRIPLEX 
	  "-o <transcripts_promoters>.TRIPLEX" 
	- we don't have much time but lots of memory so run in parallel, 
	  promoters are fairly short, so parallelize on duplexes "-rm 3"
	- but don't use all my processors I still have to work, I'll give you 3 
	  "-p 3"

  >triplexator -l 20 -e 5 -fr on -mrl 7 -mrp 3 -dc 5 -of 1 
		-o <transcripts_promoters>.TRIPLEX -po -rm 3 -p 3
        -ss <transcripts>.fasta -ds <promoters>.fasta 

---------------------------------------------------------------------------
6. Contact
---------------------------------------------------------------------------

For questions or comments, contact:
  Fabian Buske <fbuske@uq.edu.au>
Gurado/triplexator