Cut sequences at positions with few spanning molecules.
Written by Shaun Jackman, Lauren Coombe, Justin Chu, and Janet Li.
Shaun D. Jackman, Lauren Coombe, Justin Chu, Rene L. Warren, Benjamin P. Vandervalk, Sarah Yeo, Zhuyi Xue, Hamid Mohamadi, Joerg Bohlmann, Steven J.M. Jones and Inanc Birol (2018). Tigmint: correcting assembly errors using linked reads from large molecules. BMC Bioinformatics, 19(1). doi:10.1186/s12859-018-2425-6
Tigmint identifies and corrects misassemblies using linked (e.g. MGI's stLFR, 10x Genomics Chromium) or long (e.g. Oxford Nanopore Technologies long reads) DNA sequencing reads. The reads are first aligned to the assembly, and the extents of the large DNA molecules are inferred from the alignments of the reads. The physical coverage of the large molecules is more consistent and less prone to coverage dropouts than that of the short read sequencing data. The sequences are cut at positions that have insufficient spanning molecules. Tigmint outputs a BED file of these cut points, and a FASTA file of the cut sequences.
Tigmint also allows the use of long reads from Oxford Nanopore Technologies. The long reads are segmented and assigned barcodes, and the following steps of the pipeline are the same as described above.
Each window of a specified fixed size is checked for a minimum number of spanning molecules. Sequences are cut at those positions where a window with sufficient coverage is followed by some number of windows with insufficient coverage is then followed again by a window with sufficient coverage.
Install Linuxbrew on Linux or Windows Subsystem for Linux (WSL), or install Homebrew on macOS, and then run the command
brew install tigmint
conda install -c bioconda tigmint
pip3 install tigmint
docker run -it bcgsc/tigmint
Download and extract the source code.
git clone https://github.com/bcgsc/tigmint && cd tigmint
cd src
make
or
curl -L https://github.com/bcgsc/tigmint/archive/master.tar.gz | tar xz && mv tigmint-master tigmint && cd tigmint
cd src
make
pip3 install intervaltree pybedtools pysam numpy
Tigmint uses Bedtools, BWA and Samtools. These dependencies may be installed using Homebrew on macOS or Linuxbrew on Linux.
brew install bedtools bwa samtools
brew tap brewsci/bio
brew install minimap2
brew tap brewsci/bio
brew install arcs links-scaffolder
brew install abyss seqtk
To run Tigmint on the draft assembly myassembly.fa
with the reads myreads.fq.gz
, which have been run through longranger basic
:
tigmint-make tigmint draft=myassembly reads=myreads
bwa mem -C
is used to copy the BX tag from the FASTQ header to the SAM tags.samtools sort -tBX
is used to sort first by barcode and then position.
To run both Tigmint and scaffold the corrected assembly with ARCS:
tigmint-make arcs draft=myassembly reads=myreads
To run Tigmint, ARCS, and calculate assembly metrics using the reference genome GRCh38.fa
:
tigmint-make metrics draft=myassembly reads=myreads ref=GRCh38 G=3088269832
To run Tigmint with long reads in fasta or fastq format (myreads.fa.gz
or myreads.fq.gz
) on the draft assembly myassembly.fa
for an organism with a genome size of gsize:
tigmint-make tigmint-long draft=myassembly reads=myreads span=auto G=gsize dist=auto
minimap2 map-ont
is used to align long reads from the Oxford Nanopore Technologies (ONT) platform, which is the default input for Tigmint. To use PacBio long reads specify the parameterlongmap=pb
tigmint-make
is a Makefile script, and so anymake
options may also be used withtigmint-make
, such as-n
(--dry-run
).- The file extension of the assembly must be
.fa
and the reads.fq.gz
(or.fa.gz
for long reads), and the extension is not included in the parametersdraft
andreads
. These specific file name requirements result from implementing the pipeline in GNU Make. - The minimum spanning molecules parameter (
span
) fortigmint-cut
is heavily dependent on the sequence coverage of the linked or long reads provided. When running Tigmint with long reads, usespan=auto
and setG
to your assembly organism's haploid genome size for this parameter to be calculated automatically, or explicitly setspan
to a specific number if you are interested in adjusting it. See Tips for more details. - For
tigmint-long
, the maximum distance between reads threshold should be calculated automatically based on the read length distribution. This can be done by setting the parameterdist=auto
.
tigmint
: Run Tigmint, and produce a file named$draft.tigmint.fa
tigmint-long
: Run Tigmint using long reads, and produce a file named$draft.cut$cut.tigmint.fa
arcs
: Run Tigmint and ARCS, and produce a file name$draft.tigmint.arcs.fa
metrics
: Run, Tigmint, ARCS, and calculate assembly metrics usingabyss-fac
andabyss-samtobreak
, and produce TSV files.
draft
: Name of the draft assembly,myassembly.fa
reads
: Name of the reads,myreads.fq.gz
G
: Haploid genome size of the draft assembly organism. Required to calculatespan
parameter automatically. Can be given as an integer or in scientific notation (e.g. '3e9' for human) [0]span=20
: Number of spanning molecules threshold. Setspan=auto
to automatically select span parameter (currently only recommended fortigmint-long
)cut=500
: Cut length for long reads (tigmint-long
only)longmap=ont
: Long read platform;ont
for Oxford Nanopore Technologies (ONT) long reads,pb
for PacBio long reads (tigmint-long
only)window=1000
: Window size (bp) for checking spanning moleculesminsize=2000
: Minimum molecule sizeas=0.65
: Minimum AS/read length rationm=5
: Maximum number of mismatchesdist=50000
: Maximum distance (bp) between reads to be considered the same molecule. Setdist=auto
to automatically calculate dist threshold based on read length distribution (tigmint-long
only)mapq=0
: Mapping quality thresholdtrim=0
: Number of bases to trim off contigs following cutst=8
: Number of threadsac=3000
: Minimum contig length (bp) for tallying attempted corrections. This is for logging purposes only, and will not affect the performance.
c=5
e=30000
r=0.05
a=0.1
l=10
ref
: Reference genome,ref.fa
, for calculating assembly contiguity metricsG
: Size of the reference genome, for calculating NG50 and NGA50
- If your barcoded reads are in multiple FASTQ files, the initial alignments of the barcoded reads to the draft assembly can be done in parallel and merged prior to running Tigmint.
- When aligning linked reads with BWA-MEM, use the
-C
option to include the barcode in the BX tag of the alignments. - Sort by BX tag using
samtools sort -tBX
. - Merge multiple BAM files using
samtools merge -tBX
. - When aligning long reads with Minimap2, use the
-y
option to include the barcode in the BX tag of the alignments. - When using long reads, the minimum spanning molecule thresholds (
span
) should be no greater than 1/4 of the sequence coverage. Setting the parameterspan=auto
allows the appropriate parameter value to be selected automatically (this setting requires the parameterG
as well). - When using long reads, the edit distance threshold (
nm
) is automatically set to the cut length (cut
) to compensate for the higher error rate and length. This parameter should be kept relatively high to include as many alignments as possible.
To use stLFR linked reads with Tigmint, you will need to re-format the reads to have the barcode in a BX:Z:
tag in the read header.
For example, this format
@V100002302L1C001R017000000#0_0_0/1 0 1
TGTCTTCCTGGACAGCTGACATCCCTTTTGTTTTTCTGTTTGCTCAGATGCTGTCTCTTATACACATCTTAGGAAGACAAGCACTGACGACATGATCACC
+
FFFFFFFGFGFFGFDFGFFFFFFFFFFFGFFF@FFFFFFFFFFFF@FFFFFFFFFGGFFEFEFFFF?FFFFGFFFGFFFFFFFGFFEFGFGGFGFFFGFF
should be changed to:
@V100002302L1C001R017000000 BX:Z:0_0_0
TGTCTTCCTGGACAGCTGACATCCCTTTTGTTTTTCTGTTTGCTCAGATGCTGTCTCTTATACACATCTTAGGAAGACAAGCACTGACGACATGATCACC
+
FFFFFFFGFGFFGFDFGFFFFFFFFFFFGFFF@FFFFFFFFFFFF@FFFFFFFFFGGFFEFEFFFF?FFFFGFFFGFFFFFFFGFFEFGFGGFGFFFGFF
After first looking for existing issue at https://github.com/bcgsc/tigmint/issues, please report a new issue at https://github.com/bcgsc/tigmint/issues/new. Please report the names of your input files, the exact command line that you are using, and the entire output of Tigmint.