Alignments with inconsistent CIGAR/sequence length
insectopalo opened this issue · 1 comments
When running the C program to outuput a SAM file,
ssw_test -r region_of_interest.fa -c -s -h 3553-CT_goldenreads.fastq > alignment.sam
I've noted that the SAM file that does not comply with the SAM format specification:
"Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ" [1].
Example from actual output:
HWI-ST1309F:275:C8E2LANXX:3:1101:10013:85607 16 chrRCRS:6500-14600 2688 4 74=4I1X4=1I2X1=1X2=1D2=3I4=1X2=1I2=18S * 0 0 TACCTGCACGACAACACATAATGACCCACCAATCACATGCCTATCATATAGTAAAACCCAGCCCATGACCCCTATGCCTCAGGATACTCTTCAATAGCCATCGCT F7<</<B7<<<FF/FBFB/FFFB/FFFFFFFFFBFF7<F/FBFFF<BBFFFFFFFFBFFFFBFBBFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFBBBBB AS:i:152 NM:i:124 ZS:i:142
The length of the sequence reported in that entry is 105:
len(TACCTGCACGACAACACATAATGACCCACCAATCACATGCCTATCATATAGTAAAACCCAGCCCATGACCCCTATGCCTCAGGATACTCTTCAATAGCCATCGCT) = 105
The CIGAR string is 74=4I1X4=1I2X1=1X2=1D2=3I4=1X2=1I2=18S
which means 74+4+1+4+1+2+1+1+2+2+3+4+1+2+1+2+18=123
. It seems that the soft-clipped residues are not being reported in the SEQ field.
Cheers,
Gon
Dear Gon,
I apologize for the late reply.
Thank you for pointing this problem out. I've fixed this error. Please
check the latest version.
Yours,
Mengyao
On Wed, Sep 14, 2016 at 8:19 AM, Gon S. Nido notifications@github.com
wrote:
When running the C program to outuput a SAM file,
ssw_test -r region_of_interest.fa -c -s -h 3553-CT_goldenreads.fastq > alignment.sam
I've noted that the SAM file that does not comply with the SAM format
specification:"Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ"
[1].Example from actual output:
HWI-ST1309F:275:C8E2LANXX:3:1101:10013:85607 16 chrRCRS:6500-14600 2688 4 74=4I1X4=1I2X1=1X2=1D2=3I4=1X2=1I2=18S * 0 0 TACCTGCACGACAACACATAATGACCCACCAATCACATGCCTATCATATAGTAAAACCCAGCCCATGACCCCTATGCCTCAGGATACTCTTCAATAGCCATCGCT F7<</<B7<<<FF/FBFB/FFFB/FFFFFFFFFBFF7<F/FBFFF<BBFFFFFFFFBFFFFBFBBFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFBFFBBBBB AS:i:152 NM:i:124 ZS:i:142
The length of the sequence reported in that entry is 105:
len(TACCTGCACGACAACACATAATGACCCACCAATCACATGCCTATCATATAGTAAAACCCAGCCCATGACCCCTATGCCTCAGGATACTCTTCAATAGCCATCGCT) = 105
The CIGAR string is 74=4I1X4=1I2X1=1X2=1D2=3I4=1X2=1I2=18S which means
74+4+1+4+1+2+1+1+2+2+3+4+1+2+1+2+18=123. It seems that the soft-clipped
residues are not being reported in the SEQ field.Cheers,
Gon[1] https://samtools.github.io/hts-specs/SAMv1.pdf
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#40,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAlVdNt2rizsfo-gz5OL-9vZE3KBPk0vks5qp-ZhgaJpZM4J8tTU
.