mhalushka/miRge3.0

SNP fasta file (human_mirna_SNP_pseudo_miRBase.fa)

Opened this issue · 11 comments

Hi, I am wondering how to generate "human_mirna_SNP_pseudo_miRBase.fa" in "fasta.Libs" directory. It is quite different from "human_mature_miRBase.fa" as many miRNA ids are different between them in addition to SNP suffix. For example, hsa-miR-5190-5p and hsa-miR-5190-3p are found in "human_mirna_SNP_pseudo_miRBase.fa" while no suffix id (hsa-miR-5190) is only found in "human_mature_miRBase.fa". There are many other examples like this. Could you explain the details? Could you let us have a code to generate this file?

Hi @taeyoungh ,

I think what you are looking for is SNP annotation in miRBase library, is that correct? Before I could answer your question, I would like to mention that we are revising/updating libraries including MirGeneDB.
Now, concering the suffixes, the database annotations are not entirely reflected in SNP_pseudo and also, some miRNAs have this discrepencies when miRBase updated there versions individually without any evident change log. These are few things we are fixing now along with software updates. I hope this helps.

Having said that, please let us know what exactly you are looking for, since the SNP_pseudo file doesn't reflect the SNPs for reporting miRNA changes (interms of counts and RPM), this file is used to report A2I editing. If you are interested in incorporating SNPs in the miRBAse for alignment and annotations, then you need to edit the index files. Let me know if this is what you are looking for and I can assist you in having your own custom SNPs.

Thank you,
Arun.

Hi @arunhpatil ,
I am asking about this file because it seems that this file was used to generate the bowtie index. For example, when I looked into the bowtie index using bowtie-inspect, I found "hsa-miR-5190-3p". This miRNA was also found in "human_mirna_SNP_pseudo_miRBase.fa" but not in "human_mature_miRBase.fa". Instead, "human_mature_miRBase.fa" has a record of "hsa-miR-5190". I guess that you somehow add "-3p" and "-5p" to "hsa-miR-5190" in the generation of this file. Am I understood correctly? I want to understand how you generated bowtie index file.

Also, I have a question about genomic coordinates in bowtie index. The output of bowtie-inspect provides a genomic coordinate for every miRNA. For example, there is a record for "hsa-miR-1973 chr8 segs:1-21 cds:+:76202058-76202078" in the header of bowtie index. But I cannot find this miRNA in the gff3 file in the annotation folder. In this case, how did you assign the genomic coordinates to this kind of miRNA?

Thanks for your help!

@taeyoungh,

Thank you for pointing out these IDs. I will have to get back to you on these questions. With regard to the coordinates, I derive it from GFF file, which in this caes seems otherwise. This is very helpful to consider and troubleshoot logical errors. I very much appreciate you bringing this to our attention. I will get back to you shortly on this one.

Thank you,
Arun.

Hi @taeyoungh ,

The additional 5p or 3p miRNAs were added based on our previous detection of miRNA reads in the genomic loci of the annotated pir miRNAs from miRBase. These reads were part of Toward the human cellular microRNAome study and since then, we have retained these passenger miRNAs as part of our miRge library.

A small percent (394, 0.7%) were identified in more than 50 samples. Additionally, 207 were the unassigned “passenger” 5p or 3p microRNAs from a known microRNA locus, and 15 were orthologous to a different species’microRNA (primarily primate) (Supplemental Tables S8, S9).

Regarding, coordinates, I believe it is an error and I will correct them soon. Once again thank you for bring this to our attention.

I hope this is helpful.

Thank you,
Arun.

Hello, I mainly use the - ai module in your software. I want to use this module to identify miRNA editing sites in Sus_scrofa. I use the" miRge build "provided by you to create a new library, but I don't know how to customize the methods for two files, such as" human_mirna_SNP_pseudo_miRBase.fasta, human_miRNAs_in_repetitive_element_MirGeneDB. csv ". What are the rules for creating these two files? I am very looking forward to your reply. thank you.

Hi @565755044,

These files has to be created manually, the rules of this is described in the miRge paper. You can download the repetitive elements from UCSC genome browser (Select ->Tools -> Table browser and under group select Repeats for appropriate genome assembly), and miRNAs overlapping these repeate elements should be recorded in the csv file as miRNA name followed by repeate element name.

For example:
Hsa-Mir-28-P1_5p*,gene_id "L2c"; transcript_id "L2c_dup8856";
This miR-28 overlaps with L2c genomic coordinates.

Note the rules are as such, for repeats, you should have miR name seperated by comma followed by gene_id and transcript_id seperated by semicolon. For SNPs, you should have the canonical miRNA named (header) as _SNPC and any alternative mature sequence with a SNP should be denoted by _A,_B, _D etc suffixes in the FASTA file. Also, this FASTA file should be indexed (i.e., bowtie-build this new FASTA file).

For example:
>Hsa-Mir-28-P1_3p.SNPC (This is canonical miRNA sequence hence _C)
CACTAGATTGTGAGCTCCTGGA
>Hsa-Mir-28-P1_3p.SNPA
CACTAGATTGTGAGTTCCTGGA

I hope this is helpful, @mhalushka, can add if I have missed anything.

Thank you,
Arun.

Hi @565755044,

Do you know of any repository for Sus scrofa with known SNPs, if not, can you try copying your miRBase generaged FASTA file as a SNP fasta file (shown below)?

You can replace human to Sus scrofa database you have.
cp human_mature_miRBase.fa human_mirna_SNP_pseudo_miRBase.fa

This will trick the software but you will find A-to-I editing hits, you can then see what are the most abundant editings you find and check if they are true positives. Also, don't worry about the repeats for A-to-I (As far as I remember they are not connected).

I hope this helps.

Thank you.
Arun

Hi @565755044,

There may not be enough editing cites, can you share your findings. Modifying the source code may take time, instead, if you can state what you are aiming at, I can help figure out an alternate method.

Thank you,
Arun.