Valid introns being reported as deletions
vkkodali opened this issue · 7 comments
I am noticing cases where a protein aligns to the genome such that a valid intron is being reported as a deletion. Is there a setting that can be used to control this behavior? Perhaps, the ability to set minimum intron size. While it is rare to find introns <~70 bp (and they're almost non-existent <~60 bp) in humans, that's not always the case in other organisms.
An example case is shown below:
## fetch protein and genome sequences using EntrezDirect
$ efetch -db protein -id XP_052862054.1 -format fasta > prot.fa
$ efetch -db nucleotide -id NC_069144.1 -format fasta > genome.fa
$ miniprot --version
0.8-r220
$ miniprot genome.fa prot.fa 2>/dev/null
XP_052862054.1 502 0 502 + NC_069144.1 62075725 33016218 33018194 1506 1572 0 AS:i:2390 ms:i:2522 np:i:502 da:i:0 do:i:0cg:Z:9M202N174M68N67M58N39M76N109M22D104M cs:Z::9~gt202ag:174~gt68ag:67~gt58ag:39~gt76ag:109-gttagaaatggtatatctttttatgtcggtaaaatatatcacaaaacgtgttttatcattccacag:104
The issue here is the 22D104M
in the CIGAR string. That should have been a 66 nt intron instead of a 66 nt deletion, as shown in the following screenshot:
Another example is XP_052871965.1 aligned to NC_069144.1. Here, the first intron is represented as a deletion:
$ efetch -db protein -id XP_052871965.1 -format fasta > prot.fa
$ efetch -db nucleotide -id NC_069144.1 -format fasta > genome.fa
$ miniprot-0.8_x64-linux/miniprot genome.fa prot.fa 2>/dev/null
XP_052871965.1 194 0 194 + NC_069144.1 62075725 33940642 33941373 582 660 0 AS:i:952 ms:i:985 np:i:194 da:i:0 do:i:0cg:Z:34M26D99M74U60M cs:Z::34-ttagtttgacgctagactgcgattataactaaagtgcttaataataatcaaacatttctttcaactccttgtcacagt:99*aS~gt71ag-gc:60
The issue here is the 22D104M in the CIGAR string. That should have been a 66 nt intron instead of a 66 nt deletion
Do you mean 22bp?
Is there a setting that can be used to control this behavior?
You may reduce intron penalty with option -J
. The minimum intron size on the perfect condition is equal to (J-O)/E+1
. The actual threshold at a particular junction depends on the splice signal and if there is a split codon. It is often larger than (J-O)/E+1
.
Do you mean 22bp?
Since these are protein-to-genome alignments, the CIGAR string values for the operator D
are in amino acids (and need to be multiplied by 3 to get the nucleotide sizes) where as the intron operators N, V, U
are in nucleotides.
You may reduce intron penalty with option
-J
.
Do you have a recommendation for how much I should reduce it to? For the example XP_052862054.1, I had to change the value of 24 to get the correct call. But for XP_052871965.1, the value had to be further reduced to 23. Is it okay to reduce it to, say, 18, or even lower?
Oh, I forgot the factor of 3. The minimum intron size on the perfect condition should be 3*(J-O)/E+1
.
Do you have a recommendation for how much I should reduce it to?
At least larger than -O
. An excessively small -J
may find many small introns correctly but may also lead to false micro introns elsewhere. You need to evaluate across the whole genome to choose the balance.
I ran miniprot
on a larger sample to evaluate the effects of reducing -J
. Using this option mitigates the issue of incorrectly calling legitimate introns as deletions. However, for proteins that align with an unaligned region in the terimini, -J 22
creates bogus introns. I provide a couple of examples below:
$ efetch -db protein -id ARD71195.1 -format fasta > prot.fa
$ efetch -db nucleotide -id NC_069144.1 -format fasta > genome.fa
## removed cs:Z: tag for brevity
$ miniprot genome.fa prot.fa 2>/dev/null
ARD71195.1 624 127 614 - NC_069144.1 62075725 3236135 3250058 633 1470 0 AS:i:888 ms:i:1126 np:i:313 da:i:105 do:i:15 cg:Z:23M5201U38M78N41M106N78M1D11M133V36M6617U65M89V1M2D80M146V85M101V23M
$ miniprot -J 22 genome.fa prot.fa 2>/dev/null
ARD71195.1 624 59 614 - NC_069144.1 62075725 3236135 3329666 675 1698 0 AS:i:951 ms:i:1155 np:i:346 da:i:78 do:i:15 cg:Z:31M79383U19M6D17M2D23M5201U38M78N41M106N78M1D11M133V36M6617U65M89V1M2D80M146V85M101V23M
Here, using a lower value for -J
added a bogus intron 79383U
seen at the beginning of the CIGAR string. In another example, AVN97884.1 aligns to NC_069144.1 with a 258 nt bogus intron when I use -J 22
. Is there any way to avoid these?
With the latest version, it is recommended to apply -I
when the reference is a genome. It will alleviate some of the issues. The long intron case can be fixed with that but the 258 bp one won't. Aligning distant homology is hard. You have to make tradeoff between different types of errors but it is hard to fix all of them.
I was actually looking into the -I
option. I think for small eukaryotes like flies, worms, etc, the default value of 200k (from -G
) should be appropriate. For something like human, we typically use 1.2M. With the current formula of 3.6 * sqrt(refLen)
, that value ends up being ~200kb for human which I think is a bit on the lower side.
You have to make tradeoff between different types of errors but it is hard to fix all of them.
Yup, there's no way to get them all correct, all the time.
-I
is also a tradeoff. To get a few extra long introns, miniprot will report a lot more wrong long introns especially for distant homologs. There might be better heuristics but for now, it is recommended to use -I
.