/ribbit

Ribbit is a toolkit to identify tandem repeats in genome sequences.

Primary LanguageC++

ribbit-logo

ribbit

Ribbit is a tool to identify tandem repeats of variable motif sizes. The algorithm converts DNA sequences to 2-bit format and uses basic bit operations to identify tandem repeat sequences.

Table of Contents

  1. Installation
  2. Usage
  3. Inputs and Outputs
  4. Citation
  5. Contact

Installation

To install Ribbit, clone the repository and install the dependencies using the following commands:

git clone https://github.com/SowpatiLab/ribbit
cd ribbit

Usage

Here’s a basic usage example:

 ./ribbit [options] -i sequence.fasta --output results.bed

To view detailed help information
 ./ribbit -h 
The output would be given as folllowing.

  -h [ --help ]                 Ribbit tool identifies short tandem repeats 
                                with allowed levels of impurity.
  -i [ --input-file ] arg       File path for the input fasta file.
  -o [ --output-file ] arg      File path for the output file.
  -m [ --min-motif-length ] arg The minimum length of the motif of the repeats 
                                to be identified. Default: 2
  -M [ --max-motif-length ] arg The maximum length of the motif of the repeats 
                                to be identified. Default: 100
  -p [ --purity ] arg           Threshold value for the continuous number of 
                                ones found in a seed. Default: 0.85
  -l [ --min-length ] arg       The minimum length of the repeat. Default: 12
  --min-units arg               The minimum number of units of the repeat. Can 
                                be an integer value for cutoff across all motif
                                sizes, or a tab-separated file with two columns: 
                                the first is the motif size and the second is 
                                the unit cutoff. Default: 2
  --perfect-units arg           The minimum number of complete units of the 
                                repeat. Can be an integer value for cutoff 
                                across all motif sizes, or a tab-separated file 
                                with two columns: the first is the motif size and 
                                the second is the unit cutoff. Default: 2

Inputs and Outputs

-i or --input

Expects: STRING (to be used as filename)

The input file must be a valid FASTA file.

-o or --output

Expects: STRING (to be used as filename)

The output for ribbit is .bed file.

bed file output columns

S.No Column Description
1 Chromosome Chromosome or Sequence Name as specified by the first word in the FASTA header
2 Repeat Start 0-based start position of SSR in the Chromosome
3 Repeat Stop End position of SSR in the Chromosome
4 Repeat Class Class of repeat as grouped by their cyclical variations
5 Repeat Length Total length of identified repeat in nt
6 Motif count Number of complete motifs in the STR
7 Purity Purity of STR region (perfect STR = 1)
7 Repeat Strand Strand of SSR based on their cyclical variation
8 CIGAR Representing type of imperfections.

-m or --min-motif-length

The minimum length of the motif of the repeats to be identified.

-M or --max-motif-length

The maximum length of the motif of the repeats to be identified.

-p or --purity

TEXT

Bed file output example

Chromosome Start End Motif Motif Size Location Size Purity Strand CIGAR
Test_Seq 90196 90393 AC 2 197 0.949495 + 3=1X3=1X5=1D82=1X17=1X19=1X31=1I2=1X3=1X21=1I2=
Test_Seq 137451 137470 CCCGCT 6 19 1 + 19=
Test_Seq 136254 136401 GT 2 147 0.912752 + 6=1X9=1D20=1D15=1X12=1X5=1X25=1X9=1X7=1X5=1X9=1X10=1X2=1X2=
Test_Seq 139286 139306 AGTTGCTT 8 20 0.95 + 8=1X11=
Test_Seq 3538110 3538168 AATAGCAAGAGCCAGAGCTAGAGCAAAG 8 58 0.881356 + 4=1X1=2I30=1X9=1X5=1X1=1D2=
Test_Seq 4197438 4197487 CACAGCCAGCT 11 49 0.959184 + 26=1X12=1X9=
Test_Seq 4858037 4858050 CTCTTT 6 13 0.923077 + 6=1I6=
Test_Seq 5000704 5000745 TATTCGTATGCGTATTC 17 41 0.902439 + 4=1I22=1X4=2X7=

Citation

Please cite as follows :

Ribbit: Accurate identification and annotation of imperfect tandem repeat sequences in genomes

Akshay Kumar Avvaru, Anukrati Sharma, Divya Tej Sowpati Journal: doi:

Contact

For queries or suggestions, please contact: Akshay Kumar Avvaru - avvaru@ccmb.res.in Divya Tej Sowpati - tej@ccmb.res.in