Cigar format using ssw
Opened this issue · 2 comments
I want to do a SW alignment and get both a score, and a cigar as output. I have tried the following two methods using the latest pip installed version of parasail-python
:
query_seq = "VNDDGVERRRCSYCC*C*KH**Q**FYKEPFLHKVEELRF*Q*AHKN*EHPHGP*LN**GSISNSTR*TP*RPEG*TPKECG*DPAGNGMDRPPHGPESTNLGGVKPNRKARDTQKQFKLV*CWWFGTS*HPYWLCEEYRATTP*PPSQVECQ*WG*SWGRQFNNCARSSMESVPHATN*PTQQWSSKGTTYW*HDRIL*GAHLPLCNIHWPRKNCWGSLTSSGLFNSQDNYTPYRRLVEGLLCANT*SEVKL*PTQRT*E*AMAICCSY**LL*VLLQDSQSSGLLVGTSLGHIWPFQPPSTRLVQGHA*LPTR*QVRNSQGCRACCYQWLGIKTF*RP"
ref_seq = "VNDDGVERRRCSYCC*C*KH**Q**FYKEPFLHKVEELRF*Q*AHKN*EHPHGP*LN**GSISNSTR*TP*RPEG*TPKECG*DPAGNGMDRPPHGPESTNLGGVKPNRKARDTQKQFKLV*CWWFGTS*HPYWLCEEYRATTP*PPSQVECQ*WG*SWGRQFNNCARSSMESVPHATN*PTQQWSSKGTTYW*HDRIL*GAHLPLCNIHWPRKNCWGSLTSSGLFNSQDNYTPYRRLVEGLLCANT*SEVKL*PTQRT*E*AMAICCSY**LL*VLLQDSQSSGLLVGTSLGHIWPFQPPSTRLVQGHA*LPTR*QVRNSQGCRACCYQWLGIKTF*RPPKQT*ATQRAQPTL*L*LDIYGGG*DDNTTYGAVWDLLESP*CVQLYSLIIA*YALARTRRSG*RLNTVGIGRCGACNWVHQGQSYKGHEECC*RTQGSHTIRPVWPRDICSFKEIFLWWRSNREDLK"
result = parasail.ssw(query_seq, ref_seq, 10, 1, parasail.blosum62)
print(result.score1)
print(result.cigar)
print(result.ref_begin1)
print(result.ref_end1)
print(result.read_begin1)
print(result.read_end1)
#print(result.cigar.decode)
result = parasail.sw_trace(query_seq, ref_seq, 10, 1, parasail.blosum62)
print(result.cigar)
print(result.cigar.seq)
print(result.cigar.decode)
which yields
1783
[5447]
0
339
0
339
<parasail.bindings_v2.Cigar object at 0x7fa0203b7780>
[5447]
b'340='
It seems that with the first method, using ssw
, I can get the score but can't get the cigar in the standard format (instead, printing result.cigar.decode
gives error message "AttributeError: 'numpy.ndarray' object has no attribute 'decode'"). Using the sw_trace
method I can get the decoded cigar, but not the score? Is it possible to access the decoded cigar (the number 5447 doesn't mean anything to me? How should I interpret it?) in ssw
?
I had to go back and remind myself why I wrote the ssw
python routine. The parasail C library has a parasail_ssw
routine that tries to mimic the SSW library using parasail routines. It returns a specific kind of result (parasail_result_ssw
). The SSW C library has this comment on their C struct:
@field cigar best alignment cigar; stored the same as that in BAM format, high 28 bits: length, low 4 bits: M/I/D (0/1/2);
cigar = 0 when the best alignment path is not available
@field cigarLen length of the cigar string; cigarLen = 0 when the best alignment path is not available
I did not provide any user-friendly decode of the raw binary cigar from the python wrapper. It might be slower than a C equivalent, but you can write one in Python like so:
def decode(b):
__BAM_CIGAR_STR = 'MIDNSHP=X8'
def _decode(x):
l = str(x>>4)
try:
c = __BAM_CIGAR_STR[x&0xf]
except:
c = 'M'
return l+c
return ''.join([_decode(i) for i in b])
Thanks! I'll try the python decode function you suggest