lh3/miniprot

Inconsistent use of operators in the cs tag?

Percud opened this issue · 1 comments

Hi,

According to the PAF format specifications the '*' operator should indicate a substitution, while the '-' operator should indicate a deletion.

However, in this example in which a protein is aligned to the corresponding gene:

efetch -db nucleotide -id NC_041770.1 -format fasta > genome.fasta
efetch -db protein -id NP_001181663.2 -format fasta > protein.fasta
miniprot  --version
#0.12-r237
miniprot genome.fasta protein.fasta 2>/dev/null
#NP_001181663.2	173	0	173	-	NC_041770.1	95433459	7197672	7208553	519	519	0	AS:i:836	ms:i:871	np:i:173	fs:i:0	st:i:0	da:i:0	do:i:0	cg:Z:58M10365U114M	cs:Z::58*gG~gt10362ag-gc:114

The cs tag contains both '*' and a '-' operators:

cs:Z::58*gG~gt10362ag-gc:114

I understand that there is a spliced codon (ggc) here, but I find that the '*' and '-' operators are confounding because there is no substitution or deletion.

lh3 commented

Sorry for the late response. PAF is designed for nucleotide alignment. The miniprot PAF is not really PAF anyway. The use of "-" is intentional; otherwise I would need to add new operators for split codons. Note that "-" may also be used for frameshift.