/TideHunter

TideHunter: efficient and sensitive tandem repeat detection from noisy long-reads using seed-and-chain

Primary LanguageC

TideHunter: efficient and sensitive tandem repeat detection from noisy long-reads using seed-and-chain

Getting started

Download the latest release:

wget https://github.com/yangao07/TideHunter/releases/download/v1.0/TideHunter.v1.0.tar.gz
tar -zxvf TideHunter.v1.0.tar.gz
cd TideHunter; make
./bin/TideHunter ./test_data/test_50x4.fa > cons.fa

Or use git clone command:

git clone https://github.com/yangao07/TideHunter.git --recursive
cd TideHunter; make
./bin/TideHunter ./test_data/test_50x4.fa > cons.fa

Table of Contents

Introduction

TideHunter is an efficient and sensitive tandem repeat detection and consensus calling tool which is designed for tandemly repeated long-read sequence (INC-seq, R2C2, NanoAmpli-Seq).

It works with Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing data at error rates up to 20% and is able to detect repeat patterns of any size.

Installation

Operating system

TideHunter currently can only be built and run on Linux/Unix systems.

Building TideHunter

It is recommended to download the latest release of TideHunter from the release page.

wget https://github.com/yangao07/TideHunter/releases/download/v1.0/TideHunter.v1.0.tar.gz
tar -zxvf TideHunter.v1.0.tar.gz
cd TideHunter; make

Or, you can use git clone command to download the source code. Do NOT forget the --recursive. This gives you the latest version of TideHunter, which might be still under development.

git clone https://github.com/yangao07/TideHunter.git --recursive
cd TideHunter; make

Getting started with toy example in test_data

./bin/TideHunter ./test_data/test_1000x10.fa > cons.fa

Usage

Generate consensus in FASTA format

./bin/TideHunter ./test_data/test_1000x10.fa > cons.fa

Generate consensus in tabular format

./bin/TideHunter -f 2 ./test_data/test_1000x10.fa > cons.out

Generate a full-length consensus

./bin/TideHunter -5 ./test_data/5prime.fa -3 ./test_data/3prime.fa ./test_data/full_length.fa > cons_full.fa

Commands and options

Usage:   TideHunter [options] in.fa/fq > cons_out.fa

Options:
    Seeding:
         -k --kmer-length [INT]    k-mer length (no larger than 16). [8]
         -w --window-size [INT]    window size. [1]
         -s --step-size   [INT]    step size. [1]
         -H --HPC-kmer             use homopolymer-compressed k-mer. [False]
    Tandem repeat criteria:
         -c --min-copy    [INT]    minimum copy number of tandem-repeats. [2]
         -e --max-diverg  [INT]    maximum allowed divergence rate between two consecutive repeats. [0.25]
         -p --min-period  [INT]    minimum period size of tandem repeat. (>=2) [30]
         -P --max-period  [INT]    maximum period size of tandem repeat. (<=4294967295) [100K]
    Adapter sequence:
         -5 --five-prime  [STR]    5' adapter sequence (sense strand). [NULL]
         -3 --three-prime [STR]    3' adapter sequence (anti-sense strand). [NULL]
         -a --ada-mat-rat [FLT]    minimum match ratio of adapter sequence. [0.80]
    Output:
         -o --cons-out    [STR]    output file. [stdout]
         -l --longest              only output the consensus of the longest tandem repeat. [False]
         -F --full-len             only output the consensus that is full-length. [False]
         -f --out-fmt     [INT]    output format. [1]
                                       1: FASTA
                                       2: Tabular
    Computing resource:
         -t --thread      [INT]    number of threads to use. [1]

Input

TideHunter works with FASTA, FASTQ, gzip'd FASTA(.fa.gz) and gzip'd FASTQ(.fq.gz) formats.

Adapter sequence

Additional adapter sequence files can be provided to TideHunter with -5 and -3 options.

TideHunter uses adapter information to search for the full-length sequence from the generated consensus.

Once two adapters are found, TideHunter trims and reorients the consensus sequence.

Output

TideHunter can output consensus sequence in FASTA format by default, it can also provide output in tabular format.

Tabular format

For tabular format, 9 columns will be generated for each consensus sequence:

No. Column name Explanation
1 readName the original read name
2 consN N is the ID number of the consensus sequences from the same read, starts from 0
3 readLen length of the original long-read
4 start start coordinate of the tandem repeat, 1-base
5 end end coordinate of the tandem repeat, 1-base
6 consLen length of the consensus sequence
7 copyNum copy number of the tandem repeat
8 fullLen 0: not a full-length sequence, 1: sense strand full-length, 2: anti-sense strand full-length
9 consensus consensus sequence

FASTA format

For FASTA output format, the read name contains detailed information of the detected tandem repeat, i.e., the above columns 1 ~ 8. The sequence is the consensus sequence.

The read name of each consensus sequence has the following format:

>readName_consN_readLen_start_end_consLen_copyNum_fullLen

Contact

Yan Gao yangao07@hit.edu.cn

Yadong Wang ydwang@hit.edu.cn

Yi Xing XINGYI@email.chop.edu

github issues