Inconsistent use of operators in the cs tag?
Percud opened this issue · 1 comments
Hi,
According to the PAF format specifications the '*' operator should indicate a substitution, while the '-' operator should indicate a deletion.
However, in this example in which a protein is aligned to the corresponding gene:
efetch -db nucleotide -id NC_041770.1 -format fasta > genome.fasta
efetch -db protein -id NP_001181663.2 -format fasta > protein.fasta
miniprot --version
#0.12-r237
miniprot genome.fasta protein.fasta 2>/dev/null
#NP_001181663.2 173 0 173 - NC_041770.1 95433459 7197672 7208553 519 519 0 AS:i:836 ms:i:871 np:i:173 fs:i:0 st:i:0 da:i:0 do:i:0 cg:Z:58M10365U114M cs:Z::58*gG~gt10362ag-gc:114
The cs tag contains both '*' and a '-' operators:
cs:Z::58*gG~gt10362ag-gc:114
I understand that there is a spliced codon (ggc) here, but I find that the '*' and '-' operators are confounding because there is no substitution or deletion.
Sorry for the late response. PAF is designed for nucleotide alignment. The miniprot PAF is not really PAF anyway. The use of "-" is intentional; otherwise I would need to add new operators for split codons. Note that "-" may also be used for frameshift.