lh3/miniprot

Negative length for introns in cs:Z: tag

Opened this issue · 4 comments

I noticed this when testing for #33. In the example shown below, the cs:Z: tag uses the following to represent an intron: ~gt-1ag. Is this expected?

$ efetch -db protein -id BAM19251.1 -format fasta > prot.fa
$ efetch -db nucleotide -id NC_069145.1 -format fasta > genome.fa
$ /tmp/miniprot-0.8_x64-linux/miniprot -J 18 genome.fa prot.fa 2>/dev/null
BAM19251.1      210     0       210     -       NC_069145.1     87567467        83367290        83368071        546     630     0       AS:i:946        ms:i:1006       np:i:193        da:i:-1 do:i:0  cg:Z:12M2V41M77V66M78N89M  cs:Z:*acaM*gcaT*tcgL*acgM:2*ggcD*tcgW*agcR:3*gcS~gt-1ag-c:8*atcV:6*gagP*aacA:8*gagA:3*gagQ:1*attL*tctQ:1*atgV:1*gagQ:4*agR~gt74ag-a:6*aagS*acgA:31*atcV:8*tcgA:5*aacT:1*aacT:9~gt78ag:3*gaaQ*acgM:42*gacE:21*gatE:2*gtgC:16
lh3 commented

Oh, 2bp intron. Smells like a bug. I will have a look later.

lh3 commented

Actually it occurs to me that -J should be larger than -F; otherwise a frameshift will always may be aligned as an intron. You may try to reduce frameshift penalty -F. However, with an excessively small -F, you will get more frameshifts in alignment.

Actually it occurs to me that -J should be larger than -F

Thanks! Looks like at some point in the past, the default value for -F was set to 17 (see manual). I will try reducing it to 20 and test.

lh3 commented

Didn't realize the manpage was that old. Now updated to v0.9.