Question about SNPs and Library

Question

Question about SNPs and Library

Opened this issue a year ago · 9 comments

Hello, I have several questions.

Q1: In your custom human library, miRNAs of SNPs are represented and merged together (after alignment and counting). How did you decide on which SNPs to represent? Looking at NCBI browser, it is easy for me to identify additional SNPs. And when I look at the SNP frequency, it isn't quite clear if your only merging SNPs that are common (e.g. MAF > 0.01 in some population) or using some other criteria. Could you elaborate on how you decided on what SNPs to represent in the mirge3 human library?

We are interested in detecting miRNAs with 3 nucleotides (NT) on the 5'-end. Probably the best way to do this would be to extend your custom library by 1 NT on the 5'-end (as you already extended the 5'-end by 2 NT). Thus,

Q2) Can I simply use bowtie-inspect to convert the custom library into a human-readable .fasta file, add the appropriate 5' NT to the custom library, and then convert it back with the default bowtie-build command?

Naturally, I can not depend on the "exact" and "isomiR" identifications provided in the output. I will need to write my own script to loop through the mapped.csv and capture and 3-NT isomiRs on the 5'-end (if they exist).

Q3) What assembly was used for mirge3. It's hg38 but I need to know the exact version (e.g. GCA_000001405.28 or GenBank version of GRCh38.p13). When I check how well the custom fasta miRBase entries match with different assemblies... I am not getting an exact match for many of the miRNAs.

Thank you!
-John

Answer 1 · 2023-05-18T13:42:34.000Z

John-
I will try to answer these.
Q1) In the original miRge (1.0) we used the microRNASNPdb located in the mature regions. That website (http://www.bioguo.org/miRNASNP/) no longer exists, but I believe that is the source of what we used. The most common, important SNP, that we noted was in miR-28-3p.
Q2) I would suspect that 3 NT modifications of the 5' end of bona fide miRNAs are rare to non-existent, unless there is a specific disease process or manipulation occurring. I would also be concerned about claims relative to adapter sequences there, especially if they are non-template nucleotides. Nonetheless, you can create your own custom library starting with the human .fasta file. Before you go to all of this trouble, have you done a grep search or similar on the small RNA-seq fastq file to identify any of these 3 NT modified miRNAs? It may be good to, a priori, know what you might find.

Answer 2 · 2023-05-18T15:58:40.000Z

Hi @John-Drake,

Thanks @mhalushka.

Q1) I agree with Marc's suggestion on testing the 3 nt additions at 5'.

Q2) Yes. You can do that, here is an example append explained similar to the process you discussed.

Now regarding the exact and isomiR, if you have reads with and without SNPs and you also have miR sequence indexed. Then you will be able to capture that as exact rather than as isomiR. I hope this is helpful. If you are conserned about the naming convention and want to retain the .SNP in the miRNA names, then you have to truncate the file "human_merges_miRBase.csv"

Usage:
truncate -s 0 human_merges_miRBase.csv
This file is located in the library: miRge_Lib/human/annotation.Libs

Q3) I will get back to you on the exact version since the build is archived in another machine which I used back then.

Thank you,
Arun.

Answer 3 · 2023-05-21T23:01:07.000Z

@arunhpatil

If it is more convenient for you, you could simply upload the assembly to a dropbox (If you renamed the file, I understand it may be difficult to track down the exact version used).

Thank you!
-John

Answer 4 · 2023-05-21T23:58:26.000Z

Hi @John-Drake,

I believe it is GCA_000001405.15 from GRCh38. It is what is annotated in miRBase gff3 (NCBI_Assembly:GCA_000001405.15) and confirmed by the lengths of each chromosome from the genome (file attached).
chrs_genome_human.txt

Thank you,
Arun.

Answer 5 · 2023-05-22T06:41:52.000Z

@arunhpatil

I am still getting mismatch alignments. But first, let me explain what I am doing.

Generating fasta file from MirBase Annotation
bowtie-1.3.0-linux-x86_64/bowtie-inspect miRge3_Lib/human/index.Libs/human_mirna_miRBase > miRBase_annotations.fasta
Extracting information of each miRNA (chromosome, start, stop, strand), which includes the 2 NT appended to the 5'-end and the 6 NT appended to the 3'-end.
Download GCA_000001405.15 from here and index the assembly file with samtools faidx command. Care was taken to ensure I downloaded the GenBank version (RefSeq and GenBank should be the same for this release, but I am being extra careful).
Comparing the nucleotides between the miRBase_annotations.fasta and what is found in the assembly. To compare negative strand miRNAs, sequence is reversed. To get the assembly information, I use the following command:

samtools faidx GCA_000001405.26_GRCh38.p11_genomic.fna CM000671.2:94175960-94175989

Naturally, I am doing this in a script that loops through all entries in the miRBase_annotations.fasta file.

For most miRNA (90%+), everything matches as one would expect (positive and negative strand). But often, I am finding inconsistencies on the 5' and 3'-end. For instance:

`ERROR for hsa-let-7f-5p on the - strand
chr:X start:53557240 stop:53557269
Old sequence: AGTGAGGTAGTAGATTGTATAGTTGTGGGG
New sequence: GATGAGGTAGTAGATTGTATAGTTTTAGGG

ERROR for hsa-miR-16-5p on the + strand
chr:3 start:160404752 stop:160404781
Old sequence: CTTAGCAGCACGTAAATATTGGCGTTAAGA
New sequence: TCTAGCAGCACGTAAATATTGGCGTAGTGA

ERROR for hsa-miR-9-3p on the - strand
chr:5 start:88666860 stop:88666889
Old sequence: TCATAAAGCTAGATAACCGAAAGTAAAAAT
New sequence: TCATAAAGCTAGATAACCGAAAGTAAAAAC

ERROR for hsa-miR-329-5p on the + strand
chr:14 start:101027112 stop:101027142
Old sequence: GAGAGGTTTTCTGGGTTTCTGTTTCTTTAAT
New sequence: GAGAGGTTTTCTGGGTTTCTGTTTCTTTATT
`

Note: hsa-miR-3714-5p is listed on chromosome KV766192.1, which according to NCBI is a later patch of GRCh38. I did try GCA_000001405.26_GRCh38.p11_genomic.fna. Same issue.

Thus, several things are coming to mind.

The issue seems to be on the 5' and 3'-ends. And it seems to effect about 10% of the reads in the miRBase annotation.
Perhaps, accuracy in extending these sequences wasn't important for creating these custom annotations (After all... you're not going to have a canonical isomiR). However, I like to track if any identified isomiRs are following the genomic sequence... which is why I am going through these steps.
My plan is to edit the miRBase_annotations.fasta, extend the 5'-end, and re-create the ebwt files for miRBase only. Is there any consequence in the miRBase pipeline if I end up changing these added NT to what I find in the assembly file? Naturally, there may be edge-cases where the counts vary slightly. I am more concerned if there is some other element in the pipeline that depends on the nucleotides on the ends (some other function/files using the information in the index files other than bowtie).

I appreciate any feedback
-John

Answer 6 · 2023-05-23T00:04:33.000Z

Hi @John-Drake,

Thanks for bringing this to our attention.

Please see my responses below:

The issue seems to be on the 5' and 3'-ends. And it seems to effect about 10% of the reads in the miRBase annotation.
Yes, I checked with the examples you have provided and I have overlooked at few of these examples, the miRNA is coming from two identical family of miRNAs and I have missed to incorporate all of them at a few instances and reported the one of the other coordinates. I have described with the same examples as shown below:

ERROR for hsa-let-7f-5p

Hairpin sequence 
>hsa-let-7f-1 chr9 segs:1-87 cds:+:94176347-94176433
TCAGAG-->TGAGGTAGTAGATTGTATAGTT<--GTGGGGTAGTGATTTTACCCTGTTCAGGAGATAACTATACAATCTATTGCCTTCCCTGA
>hsa-let-7f-2 chrX segs:1-83 cds:-:53557192-53557274
TGTGGGA-->TGAGGTAGTAGATTGTATAGTT<--TTAGGGTCATACCCCATCTTGGAGATAACTATACAGTCTACTGTCTTTCCCACG

ERROR for hsa-miR-16-5p (from miRBase hairpin database)

Hairpin sequence 
(base) arun@DESKTOP-H231NS7:/mnt/c/Users/arun/Downloads$ zgrep "UAGCAGCACGUAAAUAUUGGCG" -B1 hairpin.fa.gz | grep "hsa" -A 1  | grep "UAGCAGCACGUAAAUAUUGGCG" -B 1
>hsa-mir-16-1 MI0000070 Homo sapiens miR-16-1 stem-loop
GUCAGCAGUGCCU-->UAGCAGCACGUAAAUAUUGGCG<--UUAAGAUUCUAAAAUUAUCUCCAGU
--
>hsa-mir-16-2 MI0000115 Homo sapiens miR-16-2 stem-loop
GUUCCACUC-->UAGCAGCACGUAAAUAUUGGCG<--UAGUGAAAUAUAUAUUAAACACCAAUAUU

I will check how many such cases are missed and correct them.

Perhaps, accuracy in extending these sequences wasn't important for creating these custom annotations (After all... you're not going to have a canonical isomiR). However, I like to track if any identified isomiRs are following the genomic sequence... which is why I am going through these steps.

Ok.

My plan is to edit the miRBase_annotations.fasta, extend the 5'-end, and re-create the ebwt files for miRBase only. Is there any consequence in the miRBase pipeline if I end up changing these added NT to what I find in the assembly file? Naturally, there may be edge-cases where the counts vary slightly. I am more concerned if there is some other element in the pipeline that depends on the nucleotides on the ends (some other function/files using the information in the index files other than bowtie).

If you consider 3nt additions at 5' in the database. As I remember correctly, there should not be any problems with the pipeline

I hope this helps. Once again, thank you for briniging this to our attention.

Regards,
Arun.

Answer 7 · 2023-05-29T12:00:18.000Z

Thank you for the information.

Scrupt

I went ahead and wrote a script that check each miRNA against the assembly and miRBase. When significant errors were discovered (e.g. wrong chromosome, position, strand), I manually corrected those. The vast majority of errors I found were small. For instance, 232 miRNA had start positions off by 1 and 59 miRNA had their 5' and 3' ends extended with nucleotides that differ from the assembly. However, I did find around 2-dozen entries with incorrect position/chromosome/strand (which should be obvious in the code where I corrected those).

Few caveats.

I used the miRBase gff file provided in the library.
I did not focus on SNPs. While I found a couple of errors with SNPs and corrected them.... my plan is to disregard the included SNPs and re-create entries for common variants using gnomAD.
This isn't script isn't exhaustive and there may be situations I overlooked.

Hopefully you will find this helpful. And if you do believe my corrections are correct, please tell me as I am always concern that I may have made a mistake.

Thank you
-John

Answer 8 · 2023-05-29T23:45:20.000Z

Hi @John-Drake,

Thank you, this is great. I will take a look at the script and let you know.

Thank you,
Arun.

Answer 9 · 2023-06-05T00:09:46.000Z

Hi @John-Drake,

I went by your notebook, you are right about the genomic co-ordinates, they are not precise. Besides, you have mentioned few miRNAs in the miRge3 database and not in miRBase there is an explanation for that. I am in the middle of re-evaluation/revision of the database and will get back to you with my responses.

For now, the total sequences in miRBase are:
zgrep -c ">hsa-" mature.fa.gz
2656

However, if we add 2 and 6 additional nucleotides the number will be:
2883

As mentioned, I will need somemore time to get back to you on this. Sorry for the delay in my response and thanks again for your comments/questions.

In the meanwhile, find here attached the miRBase database with the addition of 3nt at 5' and 6nt at 3'. This is purely dependent on miRBase and does not consist of SNPs and other miRNAs.
human_mirna_miRBase_JD.fasta.gz

Thank you,
Arun.