Ribbit is a tool to identify tandem repeats of variable motif sizes. The algorithm
converts DNA sequences to 2-bit format and uses basic bit operations to identify tandem repeat sequences.
To install Ribbit, clone the repository and install the dependencies using the following commands:
git clone https://github.com/SowpatiLab/ribbit
cd ribbit
Here’s a basic usage example:
./ribbit [options] -i sequence.fasta --output results.bed
./ribbit -h
-h [ --help ] Ribbit tool identifies short tandem repeats with allowed levels of impurity. -i [ --input-file ] arg File path for the input fasta file. -o [ --output-file ] arg File path for the output file. -m [ --min-motif-length ] arg The minimum length of the motif of the repeats to be identified. Default: 2 -M [ --max-motif-length ] arg The maximum length of the motif of the repeats to be identified. Default: 100 -p [ --purity ] arg Threshold value for the continuous number of ones found in a seed. Default: 0.85 -l [ --min-length ] arg The minimum length of the repeat. Default: 12 --min-units arg The minimum number of units of the repeat. Can be an integer value for cutoff across all motif sizes, or a tab-separated file with two columns: the first is the motif size and the second is the unit cutoff. Default: 2 --perfect-units arg The minimum number of complete units of the repeat. Can be an integer value for cutoff across all motif sizes, or a tab-separated file with two columns: the first is the motif size and the second is the unit cutoff. Default: 2
-i or --input
Expects: STRING
(to be used as filename)
The input file must be a valid FASTA file.
-o or --output
Expects: STRING
(to be used as filename)
The output for ribbit is .bed
file.
S.No | Column | Description |
---|---|---|
1 | Chromosome | Chromosome or Sequence Name as specified by the first word in the FASTA header |
2 | Repeat Start | 0-based start position of SSR in the Chromosome |
3 | Repeat Stop | End position of SSR in the Chromosome |
4 | Repeat Class | Class of repeat as grouped by their cyclical variations |
5 | Repeat Length | Total length of identified repeat in nt |
6 | Motif count | Number of complete motifs in the STR |
7 | Purity | Purity of STR region (perfect STR = 1) |
7 | Repeat Strand | Strand of SSR based on their cyclical variation |
8 | CIGAR | Representing type of imperfections. |
-m or --min-motif-length
-M or --max-motif-length
-p or --purity
Chromosome | Start | End | Motif | Motif Size | Location Size | Purity | Strand | CIGAR |
---|---|---|---|---|---|---|---|---|
Test_Seq | 90196 | 90393 | AC | 2 | 197 | 0.949495 | + | 3=1X3=1X5=1D82=1X17=1X19=1X31=1I2=1X3=1X21=1I2= |
Test_Seq | 137451 | 137470 | CCCGCT | 6 | 19 | 1 | + | 19= |
Test_Seq | 136254 | 136401 | GT | 2 | 147 | 0.912752 | + | 6=1X9=1D20=1D15=1X12=1X5=1X25=1X9=1X7=1X5=1X9=1X10=1X2=1X2= |
Test_Seq | 139286 | 139306 | AGTTGCTT | 8 | 20 | 0.95 | + | 8=1X11= |
Test_Seq | 3538110 | 3538168 | AATAGCAAGAGCCAGAGCTAGAGCAAAG | 8 | 58 | 0.881356 | + | 4=1X1=2I30=1X9=1X5=1X1=1D2= |
Test_Seq | 4197438 | 4197487 | CACAGCCAGCT | 11 | 49 | 0.959184 | + | 26=1X12=1X9= |
Test_Seq | 4858037 | 4858050 | CTCTTT | 6 | 13 | 0.923077 | + | 6=1I6= |
Test_Seq | 5000704 | 5000745 | TATTCGTATGCGTATTC | 17 | 41 | 0.902439 | + | 4=1I22=1X4=2X7= |
Please cite as follows :
Ribbit: Accurate identification and annotation of imperfect tandem repeat sequences in genomes Akshay Kumar Avvaru, Anukrati Sharma, Divya Tej Sowpati Journal: doi:For queries or suggestions, please contact: Akshay Kumar Avvaru - avvaru@ccmb.res.in Divya Tej Sowpati - tej@ccmb.res.in