"Not species specific" when sample is Mycobacterium Tuberculosis
Closed this issue · 5 comments
When checking species specificity, samples can be discarded because Kraken may classify reads as Mycobacterium, rather than Mycobacterium tuberculosis. However, the code (below) only checks for Mycobacterium tuberculosis.
def runKraken(self):
...
for lines in fh1:
fields = lines.rstrip("\r\n").split("\t")
if fields[5].find("Mycobacterium tuberculosis") != -1:
cov += float(fields[0])
fh1.close()
if cov < 90:
self.__CallCommand('mv', ['mv', self.fOut, self.flog])
#self.__CallCommand('rm', ['rm', self.kraken + "/kraken.txt"])
self.__logFH.write("not species specific\n")
i = datetime.now()
self.__logFH2.write(i.strftime('%Y/%m/%d %H:%M:%S') + "\t" + "Input:" + "\t" + self.input + "\t" + "not species specific\n")
sys.exit(2)
The final_report.txt for Kraken contains:
0.01 407 407 U 0 unclassified
99.99 3225236 0 - 1 root
99.99 3225235 3 - 131567 cellular organisms
99.99 3225231 114 D 2 Bacteria
99.98 3224864 9 - 1783272 Terrabacteria group
99.92 3223162 2 P 201174 Actinobacteria
99.92 3223160 5 C 1760 Actinobacteria
99.92 3223117 21 O 85007 Corynebacteriales
99.92 3223094 87 F 1762 Mycobacteriaceae
99.92 3223004 2709350 G 1763 Mycobacterium
15.83 510550 287089 - 77643 Mycobacterium tuberculosis complex
6.76 218100 200172 S 1773 Mycobacterium tuberculosis
0.44 14259 14259 - 1334058 Mycobacterium tuberculosis TRS12
For reference, here is an accession that I have tested:
- SRR6397355
You are right, edited the line to reflect 'Mycobacterium tuberculosis complex'
Since Kraken can sometimes be too general in its classification (Mycobacterium instead of Mycobacterium tuberculosis), would changing that line to accept "Mycobacterium" work better since then viable samples would not be discarded by UVP?
Actually, wouldn't the find("Mycobacterium tuberculosis") match the "Mycobacterium tuberculosis complex" in addition to all subsequent "Mycobacterium tuberculosis" containing lines?
The issue (I think) may lie in the Kraken database used.
I've compared Kraken results from the Galaxy server (using the bacteria database) and from a local machine (using the standard database) that show the same kind of result that matnguyen got. The Galaxy results showed MTBC cov values at > 90, while the local versions topped out at roughly 25.
Should I be looking at using the Kraken bacteria database instead?