/telociraptor

Telomere Prediction and Genome Assembly Editing Tool

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Telociraptor: Telomere Prediction and Genome Assembly Editing Tool

Telociraptor v0.8.0

For a better rendering and navigation of this document, please download and open ./docs/telociraptor.docs.html, or visit https://slimsuite.github.io/telociraptor/. Documentation can also be generated by running Telociraptor with the dochtml=T option. (R and pandoc must be installed - see below.)

Introduction

Telociraptor is a genome assembly editing tool that combined the standalone telomere prediction functions from Diploidocus with some simple genome assembly scaffolding functions that were originally packaged up as "GenomeTweak". Telociraptor predicts telomeres at the end of assembly scaffolds, where (by default) at least 50% of the terminal 50 bp matches the telomere regex provided. This approach is based on FindTelomeres by Jana Sperschneider. In addition, Telociraptor will generate a simple text assembly map (see below) with contig positions and gap lengths. By default, Telomere predictions are also made for each contig, and internal telomeres marked in the map. Finally, Telociraptor will try to fix scaffolding errors based on telomere predictions, either inverting or trimming the ends of scaffolds (see below). Alternatively, the assembly map can be manually edited/generated and used to create a new fasta file from an existing assembly. As an extra bonus, Telociraptor will generate the "telonull" telomere output and assembly gaps table used as input by ChromSyn.

~ Telomere finding [telomeres=T] ~

Telociraptor performs a regex-based search for Telomeres, based on FindTelomeres. By default, this looks for a canonical telomere motif of TTAGGG/CCCTAA, allowing for some variation. Telociraptor searches for a forward telomere regex sequence of C{2,4}T{1,2}A{1,3} at the 5' end, and a reverse sequence at the 3' end of T{1,3}A{1,2}G{2,4}. These can be set with telofwd=X and telorev=X. For each sequence, Telociraptor trims off any trailing Ns and then searches for telomere-like sequences at sequence ends. For each sequence, the presence/absence and length of trimming are reported for the 5' end (tel5 and trim5) and 3' end (tel3 and trim3), along with the total percentage telomeric sequence (TelPerc).

Telomeres are marked if at least 50% (teloperc=PERC) of the terminal 50 bp (telosize=INT) matches the appropriate regex. If either end contains a telomere, the total percentage of the sequence matching either regex is calculated as TelPerc. Note that this number neither restricts matches to the termini, not includes sequences within predicted telomeres that do not match the regex. By default, only sequences with telomeres are output to the *.telomeres.tdt output, but switching telonull=T will output all sequences. This can be useful if you also need a table of sequence lengths, and is the recommended input for ChromSyn.

~ GenomeTweak mode [tweak=T] ~

GenomeTweak was a simple tool designed to help manual curation of genome assembly scaffolding. A genome assembly file is provided with seqin=FILE. If run with mapout=FiLE then an assembly map will be generated from this assembly file. If a *.gaps.tdt is found (or gapfile=TDT), this will be used for gap positions. Otherwise, rje_seqlist will generate the gaps file and use that. If telofile=TDT is provided, telomere positions for the contigs will be loaded in. (These must match elements of the assembly map.) If that file is missing, the assembly will be broken into contigs and telomere prediction run on the contigs to generate.

If mapin=FILE is given, then GenomeTweak will generate a new assembly (seqout=FILE [$BASEFILE.tweak.fasta]) based on the map file and extracting sequences from the input assembly.

The map format is as follows:

||NewName Description>>SeqName:Start-End:Strand|~GapLen~| ... |SeqName:Start-End:Strand<<

Where there is a full-length sequence, Start-End is not required:

||NewName Description>>SeqName|~GapLen~| ... |SeqName<<

Gaps can have zero length.

If telomere predictions have been loaded from a table (must match SeqName exactly) then 5' and 3' telomeres will be annotated in the file with { and }:

||NewName Description>>{SeqName:Start-End:Strand|~GapLen~| ... |SeqName:Start-End:Strand}<<

GenomeTweak mode can also be used for some direct auto-correction of genome assemblies (autofix=T), producing *.tweak.* output (or seqout=FILE and fixout=FILE for the assembly and assembly map, respectively). This goes through up to four editing cycles, depending on input settings:

  1. Removal of any contigs given by badcontig=LIST and descaffold=LIST. The former will be entirely removed from the assembly (e.g. contigs that have been identified by DepthKopy as lacking sequence coverage), whilst the latter will be excised from scaffolds but left in the assembly as a contig. (These can be comma separated lists, or text files containing one contig per line.) If removed, the downstream gap will also be removed, unless it is the 3' end contig, in which case the upstream gap will be removed.

  2. Any regions in the MapIn file flanked by / characters (/../) will be inverted.

  3. Contigs and regions provided by invert=LIST will be inverted. These are searched directly in the sequence map and can either be direct matches, or have an internal .. that will map onto any internal sequence map elements. In this case, the entire chunk including the intervening region will be inverted.

  4. Finally, each sequence is considered in term and assessed with respect to internal telomere positions. First, end inversions are identified as inward-facing telomeres that are (a) within invlimit=INT bp of the end, and (b) more terminal than an outward-facing telomere. Following this, the ends of the scaffolds will be trimmed where there is an outward-facing telomere within trimlimit=INT bp of the end. If trimfrag=T then these will be split into contigs, otherwise the whole scaffold chunk will be a new sequence. Where a possible inversion or trim is identified but beyond the distance limits, it will appear in the log as a #NOTE.

Citation

Telociraptor has not yet been published. Please cite github in the meantime.


Running Telociraptor

Telociraptor is written in Python 2.x or Python 3.x and can be run directly from the commandline:

python $CODEPATH/telociraptor.py [OPTIONS]

If running as part of SLiMSuite, $CODEPATH will be the SLiMSuite tools/ directory. If running from the standalone Telociraptor git repo, $CODEPATH will be the path the to code/ directory. Please see details in the Telociraptor git repo for running on example data.

Dependencies

The main Telociraptor functions do not currently have any dependencies. To generate documentation with dochtml, R will need to be installed and a pandoc environment variable must be set, e.g.

export RSTUDIO_PANDOC=/Applications/RStudio.app/Contents/MacOS/pandoc

For Telociraptor documentation, run with dochtml=T and read the *.docs.html file generated.

Commandline options

A list of commandline options can be generated at run-time using the -h or help flags. Please see the general SLiMSuite documentation for details of how to use commandline options, including setting default values with INI files.

### ~ Main Telociraptor run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqin=FILE      : Input sequence assembly [None]
basefile=FILE   : Root of output file names [$SEQINBASE]
tweak=T/F       : Whether to execute GenomeTweak pipeline [False]
chromsyn=T/F    : Whether to execute ChromSyn preparation (gaps.tdt and telonull telomere prediction) [False]
dochtml=T/F     : Generate HTML Telociraptor documentation (*.docs.html) instead of main run [False]
### ~ Telomere options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
telomeres=T/F   : Whether to generate telomere predictions [True]
telonull=T/F    : Whether to output sequences without telomeres to telomere table [False]
telofile=TDT    : Delimited file of `seqname seqlen tel5 tel3` [$SEQINBASE.telomeres.tdt]
telofwd=X       : Regex for 5' telomere sequence search [C{2,4}T{1,2}A{1,3}]
telorev=X       : Regex for 5' telomere sequence search [T{1,3}A{1,2}G{2,4}]
telosize=INT    : Size of terminal regions (bp) to scan for telomeric repeats [50]
teloperc=PERC   : Percentage of telomeric region matching telomeric repeat to call as telomere [50]
### ~ Genome Tweak run options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
seqout=FILE     : Output sequence filename [$BASEFILE.tweak.fasta]
mapin=FILE      : Text file input for genome assembly map [None]
mapout=FILE     : Text file output for genome assembly map [$BASEFILE.ctgmap.txt]
gapfile=TDT     : Delimited file of `seqname start end seqlen gaplen` [$SEQINBASE.gaps.tdt]
autofix=T/F     : Whether to try to fix terminal inversions based on telomere predictions [True]
fixout=FILE     : Text file output for auto-fixed genome assembly map [$BASEFILE.tweak.txt]
badcontig=LIST  : List of contigs to completely remove from the assembly (prior to inversion) []
descaffold=LIST : List of contigs to remove from scaffolds but keep in assembly (prior to inversion) []
invert=LIST     : List of contigs/regions to invert (in order) []
invlimit=NUM    : Limit inversion distance to within X% (<=100) or Xbp (>100) of chromosome termini [25]
trimlimit=NUM   : Limit trimming distance to within X% (<=100) or Xbp (>100) of chromosome termini [5]
trimfrag=T/F    : Whether to fragment trimmed scaffold chunks into contigs [False]
### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###

Telociraptor workflow and options

Details of the Telociraptor workflow will be added with time. Please contact the author or raise an issue on GitHub if you have any questions.


© 2023 Richard Edwards | rich.edwards@uwa.edu.au