mhalushka/miRge3.0

UMI output files

Closed this issue · 1 comments

Hi — is there an output file that is equivalent to miR.Counts.csv but contains UMI counts instead of read counts for each miRNA?

@francois-a,

Yes and no. There is no direct way to link with miRNA, but if you know the sequence (from miRBase or miRGeneDB) there is a way to link it, described below:

Command used here for an input file, with 4 bp UMI sequence across reads:
miRge3.0 -s SRR5233961.fastq -lib /mnt/d/Halushka_lab/Arun/GTF_Repeats_miRge2to3/miRge3_Lib/revised_hsa -on human -db mirgenedb -umi 4,4 -a illumina -udd -o output_dir

Note: option -udd is important to remove PCR duplicates and this option will write the UMI sequences, corresponding miRNA sequence, and the number of times it is occurring.

For example, I am showing UMI of 4 bases on both ends of the reads and the one below is for let-7a:

grep -w "TGAGGTAGTAGGTTGTATAGTT" output_dir/miRge.2021-12-12_17-20-41/mapped.csv
TGAGGTAGTAGGTTGTATAGTT,1,Hsa-Let-7-P2a1_5p,,,,,,,,,6741

and
grep ",TGAGGTAGTAGGTTGTATAGTT," output_dir/miRge.2021-12-12_17-20-41/SRR5233961_umiCounts.csv | head
AGTGCTAC,TGAGGTAGTAGGTTGTATAGTT,15
GTTCCTAC,TGAGGTAGTAGGTTGTATAGTT,8
CCATCTAC,TGAGGTAGTAGGTTGTATAGTT,15
TTAGTGGG,TGAGGTAGTAGGTTGTATAGTT,2
TACCCTAC,TGAGGTAGTAGGTTGTATAGTT,835
AAACCTGA,TGAGGTAGTAGGTTGTATAGTT,7
NACCCTAC,TGAGGTAGTAGGTTGTATAGTT,19
NTAGCTCA,TGAGGTAGTAGGTTGTATAGTT,1
GAGCCTAC,TGAGGTAGTAGGTTGTATAGTT,45
GGACCCTA,TGAGGTAGTAGGTTGTATAGTT,5

Details: the first sequence AGTGCTAC,TGAGGTAGTAGGTTGTATAGTT,15, AGTG is the first four bases and CTAC is the last four bases of the UMI, and is repeated 15 times.

grep -c ",TGAGGTAGTAGGTTGTATAGTT," output_dir/miRge.2021-12-12_17-20-41/SRR5233961_umiCounts.csv
6741

You could use this file to determine the UMI counts. Let me know if you need more clarification.

Thank you,
Arun.