pritykinlab/guidescan-cli

specificity values > 1 for certain kmers

vineetbansal opened this issue · 1 comments

When using the following kmer file:

id,sequence,pam,chromosome,position,sense
AAGACTGTGCGCTAATCTCT_1,AAGACTGTGCGCTAATCTCT,NGG,unknown,0,+

with guidescan enumerate (version v2.1.6) against hg38_noalt.index (mismatches 3 and alt-pam NAG), we get the following sam line:

AAGACTGTGCGCTAATCTCT_1	0	unknown	0	100	23M	*	0	0	AAGACTGTGCGCTAATCTCTNGG	*	k0:i:1	k1:i:0	k2:i:0	k3:i:3	of:H:c53bf70d00000000000000000000000092ef3a47ffffffff010000000000000092ef3a47ffffffff020000000000000092ef3a47ffffffff9ba7545c000000009c2c699e000000002afe3c2000000000030000000000000092ef3a47ffffffff	sp:f:2.391802

or the following csv lines (succinct mode):

id,sequence,match_chrm,match_position,match_strand,match_distance,specificity
AAGACTGTGCGCTAATCTCT_1,AAGACTGTGCGCTAATCTCTNGG,chr1,234306480,+,0,2.391802
AAGACTGTGCGCTAATCTCT_1,AAGACTGTGCGCTAATCTCTNGG,chr9,12562870,+,3,2.391802
AAGACTGTGCGCTAATCTCT_1,AAGACTGTGCGCTAATCTCTNGG,chr19,3281519,+,3,2.391802
AAGACTGTGCGCTAATCTCT_1,AAGACTGTGCGCTAATCTCTNGG,chr3,49718166,+,3,2.391802

There's clearly something wrong here since specificity is reported > 1.

The actual sequence found in the fna using grep is AAGACTGTGCGCTAATCTCTTAG (i.e. with the alt-pam), indicating that the match reported in both the csv/sam cases is incorrect (the NGG was automatically added). All such detected cases of specificity > 1 seem to be with matches that have the NAG PAM.