lh3/miniprot

Valid introns being reported as deletions

vkkodali opened this issue · 7 comments

I am noticing cases where a protein aligns to the genome such that a valid intron is being reported as a deletion. Is there a setting that can be used to control this behavior? Perhaps, the ability to set minimum intron size. While it is rare to find introns <~70 bp (and they're almost non-existent <~60 bp) in humans, that's not always the case in other organisms.

An example case is shown below:

## fetch protein and genome sequences using EntrezDirect
$ efetch -db protein -id XP_052862054.1 -format fasta > prot.fa
$ efetch -db nucleotide -id NC_069144.1 -format fasta > genome.fa
$ miniprot --version
0.8-r220
$ miniprot genome.fa prot.fa 2>/dev/null
XP_052862054.1  502     0       502     +       NC_069144.1     62075725        33016218        33018194        1506    1572    0       AS:i:2390       ms:i:2522       np:i:502        da:i:0  do:i:0cg:Z:9M202N174M68N67M58N39M76N109M22D104M        cs:Z::9~gt202ag:174~gt68ag:67~gt58ag:39~gt76ag:109-gttagaaatggtatatctttttatgtcggtaaaatatatcacaaaacgtgttttatcattccacag:104

The issue here is the 22D104M in the CIGAR string. That should have been a 66 nt intron instead of a 66 nt deletion, as shown in the following screenshot:
image

Another example is XP_052871965.1 aligned to NC_069144.1. Here, the first intron is represented as a deletion:

$ efetch -db protein -id XP_052871965.1 -format fasta > prot.fa
$ efetch -db nucleotide -id NC_069144.1 -format fasta > genome.fa
$ miniprot-0.8_x64-linux/miniprot genome.fa prot.fa 2>/dev/null
XP_052871965.1  194     0       194     +       NC_069144.1     62075725        33940642        33941373        582     660     0       AS:i:952        ms:i:985        np:i:194        da:i:0  do:i:0cg:Z:34M26D99M74U60M     cs:Z::34-ttagtttgacgctagactgcgattataactaaagtgcttaataataatcaaacatttctttcaactccttgtcacagt:99*aS~gt71ag-gc:60
lh3 commented

The issue here is the 22D104M in the CIGAR string. That should have been a 66 nt intron instead of a 66 nt deletion

Do you mean 22bp?

Is there a setting that can be used to control this behavior?

You may reduce intron penalty with option -J. The minimum intron size on the perfect condition is equal to (J-O)/E+1. The actual threshold at a particular junction depends on the splice signal and if there is a split codon. It is often larger than (J-O)/E+1.

Do you mean 22bp?

Since these are protein-to-genome alignments, the CIGAR string values for the operator D are in amino acids (and need to be multiplied by 3 to get the nucleotide sizes) where as the intron operators N, V, U are in nucleotides.

You may reduce intron penalty with option -J.

Do you have a recommendation for how much I should reduce it to? For the example XP_052862054.1, I had to change the value of 24 to get the correct call. But for XP_052871965.1, the value had to be further reduced to 23. Is it okay to reduce it to, say, 18, or even lower?

lh3 commented

Oh, I forgot the factor of 3. The minimum intron size on the perfect condition should be 3*(J-O)/E+1.

Do you have a recommendation for how much I should reduce it to?

At least larger than -O. An excessively small -J may find many small introns correctly but may also lead to false micro introns elsewhere. You need to evaluate across the whole genome to choose the balance.

I ran miniprot on a larger sample to evaluate the effects of reducing -J. Using this option mitigates the issue of incorrectly calling legitimate introns as deletions. However, for proteins that align with an unaligned region in the terimini, -J 22 creates bogus introns. I provide a couple of examples below:

$ efetch -db protein -id ARD71195.1  -format fasta > prot.fa
$ efetch -db nucleotide -id NC_069144.1 -format fasta > genome.fa
## removed cs:Z: tag for brevity
$ miniprot genome.fa prot.fa 2>/dev/null
ARD71195.1      624     127     614     -       NC_069144.1     62075725        3236135 3250058 633     1470    0       AS:i:888        ms:i:1126       np:i:313        da:i:105        do:i:15 cg:Z:23M5201U38M78N41M106N78M1D11M133V36M6617U65M89V1M2D80M146V85M101V23M   
$ miniprot -J 22 genome.fa prot.fa 2>/dev/null
ARD71195.1      624     59      614     -       NC_069144.1     62075725        3236135 3329666 675     1698    0       AS:i:951        ms:i:1155       np:i:346        da:i:78 do:i:15 cg:Z:31M79383U19M6D17M2D23M5201U38M78N41M106N78M1D11M133V36M6617U65M89V1M2D80M146V85M101V23M        

Here, using a lower value for -J added a bogus intron 79383U seen at the beginning of the CIGAR string. In another example, AVN97884.1 aligns to NC_069144.1 with a 258 nt bogus intron when I use -J 22. Is there any way to avoid these?

lh3 commented

With the latest version, it is recommended to apply -I when the reference is a genome. It will alleviate some of the issues. The long intron case can be fixed with that but the 258 bp one won't. Aligning distant homology is hard. You have to make tradeoff between different types of errors but it is hard to fix all of them.

I was actually looking into the -I option. I think for small eukaryotes like flies, worms, etc, the default value of 200k (from -G) should be appropriate. For something like human, we typically use 1.2M. With the current formula of 3.6 * sqrt(refLen), that value ends up being ~200kb for human which I think is a bit on the lower side.

You have to make tradeoff between different types of errors but it is hard to fix all of them.

Yup, there's no way to get them all correct, all the time.

lh3 commented

-I is also a tradeoff. To get a few extra long introns, miniprot will report a lot more wrong long introns especially for distant homologs. There might be better heuristics but for now, it is recommended to use -I.