TideHunter: efficient and sensitive tandem repeat detection from noisy long-reads using seed-and-chain
Download the latest release:
wget https://github.com/yangao07/TideHunter/releases/download/v1.0/TideHunter.v1.0.tar.gz
tar -zxvf TideHunter.v1.0.tar.gz
cd TideHunter; make
./bin/TideHunter ./test_data/test_50x4.fa > cons.fa
Or use git clone
command:
git clone https://github.com/yangao07/TideHunter.git --recursive
cd TideHunter; make
./bin/TideHunter ./test_data/test_50x4.fa > cons.fa
- Introduction
- Installation
- Getting started with toy example in
test_data
- Usage
- Commands and options
- Input
- Output
- Contact
TideHunter is an efficient and sensitive tandem repeat detection and consensus calling tool which is designed for tandemly repeated long-read sequence (INC-seq, R2C2, NanoAmpli-Seq).
It works with Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing data at error rates up to 20% and is able to detect repeat patterns of any size.
TideHunter currently can only be built and run on Linux/Unix systems.
It is recommended to download the latest release of TideHunter from the release page.
wget https://github.com/yangao07/TideHunter/releases/download/v1.0/TideHunter.v1.0.tar.gz
tar -zxvf TideHunter.v1.0.tar.gz
cd TideHunter; make
Or, you can use git clone
command to download the source code. Do NOT forget the --recursive
.
This gives you the latest version of TideHunter, which might be still under development.
git clone https://github.com/yangao07/TideHunter.git --recursive
cd TideHunter; make
./bin/TideHunter ./test_data/test_1000x10.fa > cons.fa
./bin/TideHunter ./test_data/test_1000x10.fa > cons.fa
./bin/TideHunter -f 2 ./test_data/test_1000x10.fa > cons.out
./bin/TideHunter -5 ./test_data/5prime.fa -3 ./test_data/3prime.fa ./test_data/full_length.fa > cons_full.fa
Usage: TideHunter [options] in.fa/fq > cons_out.fa
Options:
Seeding:
-k --kmer-length [INT] k-mer length (no larger than 16). [8]
-w --window-size [INT] window size. [1]
-s --step-size [INT] step size. [1]
-H --HPC-kmer use homopolymer-compressed k-mer. [False]
Tandem repeat criteria:
-c --min-copy [INT] minimum copy number of tandem-repeats. [2]
-e --max-diverg [INT] maximum allowed divergence rate between two consecutive repeats. [0.25]
-p --min-period [INT] minimum period size of tandem repeat. (>=2) [30]
-P --max-period [INT] maximum period size of tandem repeat. (<=4294967295) [100K]
Adapter sequence:
-5 --five-prime [STR] 5' adapter sequence (sense strand). [NULL]
-3 --three-prime [STR] 3' adapter sequence (anti-sense strand). [NULL]
-a --ada-mat-rat [FLT] minimum match ratio of adapter sequence. [0.80]
Output:
-o --cons-out [STR] output file. [stdout]
-l --longest only output the consensus of the longest tandem repeat. [False]
-F --full-len only output the consensus that is full-length. [False]
-f --out-fmt [INT] output format. [1]
1: FASTA
2: Tabular
Computing resource:
-t --thread [INT] number of threads to use. [1]
TideHunter works with FASTA, FASTQ, gzip'd FASTA(.fa.gz) and gzip'd FASTQ(.fq.gz) formats.
Additional adapter sequence files can be provided to TideHunter with -5
and -3
options.
TideHunter uses adapter information to search for the full-length sequence from the generated consensus.
Once two adapters are found, TideHunter trims and reorients the consensus sequence.
TideHunter can output consensus sequence in FASTA format by default, it can also provide output in tabular format.
For tabular format, 9 columns will be generated for each consensus sequence:
No. | Column name | Explanation |
---|---|---|
1 | readName | the original read name |
2 | consN | N is the ID number of the consensus sequences from the same read, starts from 0 |
3 | readLen | length of the original long-read |
4 | start | start coordinate of the tandem repeat, 1-base |
5 | end | end coordinate of the tandem repeat, 1-base |
6 | consLen | length of the consensus sequence |
7 | copyNum | copy number of the tandem repeat |
8 | fullLen | 0: not a full-length sequence, 1: sense strand full-length, 2: anti-sense strand full-length |
9 | consensus | consensus sequence |
For FASTA output format, the read name contains detailed information of the detected tandem repeat, i.e., the above columns 1 ~ 8. The sequence is the consensus sequence.
The read name of each consensus sequence has the following format:
>readName_consN_readLen_start_end_consLen_copyNum_fullLen
Yan Gao yangao07@hit.edu.cn
Yadong Wang ydwang@hit.edu.cn
Yi Xing XINGYI@email.chop.edu