jeffdaily/parasail-python

Cigar format using ssw

Opened this issue · 2 comments

I want to do a SW alignment and get both a score, and a cigar as output. I have tried the following two methods using the latest pip installed version of parasail-python:

query_seq = "VNDDGVERRRCSYCC*C*KH**Q**FYKEPFLHKVEELRF*Q*AHKN*EHPHGP*LN**GSISNSTR*TP*RPEG*TPKECG*DPAGNGMDRPPHGPESTNLGGVKPNRKARDTQKQFKLV*CWWFGTS*HPYWLCEEYRATTP*PPSQVECQ*WG*SWGRQFNNCARSSMESVPHATN*PTQQWSSKGTTYW*HDRIL*GAHLPLCNIHWPRKNCWGSLTSSGLFNSQDNYTPYRRLVEGLLCANT*SEVKL*PTQRT*E*AMAICCSY**LL*VLLQDSQSSGLLVGTSLGHIWPFQPPSTRLVQGHA*LPTR*QVRNSQGCRACCYQWLGIKTF*RP"
ref_seq = "VNDDGVERRRCSYCC*C*KH**Q**FYKEPFLHKVEELRF*Q*AHKN*EHPHGP*LN**GSISNSTR*TP*RPEG*TPKECG*DPAGNGMDRPPHGPESTNLGGVKPNRKARDTQKQFKLV*CWWFGTS*HPYWLCEEYRATTP*PPSQVECQ*WG*SWGRQFNNCARSSMESVPHATN*PTQQWSSKGTTYW*HDRIL*GAHLPLCNIHWPRKNCWGSLTSSGLFNSQDNYTPYRRLVEGLLCANT*SEVKL*PTQRT*E*AMAICCSY**LL*VLLQDSQSSGLLVGTSLGHIWPFQPPSTRLVQGHA*LPTR*QVRNSQGCRACCYQWLGIKTF*RPPKQT*ATQRAQPTL*L*LDIYGGG*DDNTTYGAVWDLLESP*CVQLYSLIIA*YALARTRRSG*RLNTVGIGRCGACNWVHQGQSYKGHEECC*RTQGSHTIRPVWPRDICSFKEIFLWWRSNREDLK"
result = parasail.ssw(query_seq, ref_seq, 10, 1, parasail.blosum62)
print(result.score1)
print(result.cigar)
print(result.ref_begin1)
print(result.ref_end1)
print(result.read_begin1)
print(result.read_end1)
#print(result.cigar.decode)
result = parasail.sw_trace(query_seq, ref_seq, 10, 1, parasail.blosum62)
print(result.cigar)
print(result.cigar.seq)
print(result.cigar.decode)

which yields

1783
[5447]
0
339
0
339
<parasail.bindings_v2.Cigar object at 0x7fa0203b7780>
[5447]
b'340='

It seems that with the first method, using ssw, I can get the score but can't get the cigar in the standard format (instead, printing result.cigar.decode gives error message "AttributeError: 'numpy.ndarray' object has no attribute 'decode'"). Using the sw_trace method I can get the decoded cigar, but not the score? Is it possible to access the decoded cigar (the number 5447 doesn't mean anything to me? How should I interpret it?) in ssw?

I had to go back and remind myself why I wrote the ssw python routine. The parasail C library has a parasail_ssw routine that tries to mimic the SSW library using parasail routines. It returns a specific kind of result (parasail_result_ssw). The SSW C library has this comment on their C struct:

	@field	cigar	best alignment cigar; stored the same as that in BAM format, high 28 bits: length, low 4 bits: M/I/D (0/1/2);
					cigar = 0 when the best alignment path is not available
	@field	cigarLen	length of the cigar string; cigarLen = 0 when the best alignment path is not available

I did not provide any user-friendly decode of the raw binary cigar from the python wrapper. It might be slower than a C equivalent, but you can write one in Python like so:

def decode(b):
    __BAM_CIGAR_STR = 'MIDNSHP=X8'
    def _decode(x):
        l = str(x>>4)
        try:
            c = __BAM_CIGAR_STR[x&0xf]
        except:
            c = 'M'
        return l+c
    return ''.join([_decode(i) for i in b])

Thanks! I'll try the python decode function you suggest