CPTR-ReSeqTB/UVP

"Not species specific" when sample is Mycobacterium Tuberculosis

Closed this issue · 5 comments

When checking species specificity, samples can be discarded because Kraken may classify reads as Mycobacterium, rather than Mycobacterium tuberculosis. However, the code (below) only checks for Mycobacterium tuberculosis.

def  runKraken(self):
...
    for lines in fh1:
                fields = lines.rstrip("\r\n").split("\t")
                if fields[5].find("Mycobacterium tuberculosis") != -1:
                   cov += float(fields[0])
            fh1.close()
            if cov < 90:
               self.__CallCommand('mv', ['mv', self.fOut, self.flog])
               #self.__CallCommand('rm', ['rm', self.kraken + "/kraken.txt"])
               self.__logFH.write("not species specific\n")
               i = datetime.now()
               self.__logFH2.write(i.strftime('%Y/%m/%d %H:%M:%S') + "\t" + "Input:" + "\t" + self.input + "\t" + "not species specific\n")
           sys.exit(2) 

The final_report.txt for Kraken contains:

  0.01	407	407	U	0	unclassified
 99.99	3225236	0	-	1	root
 99.99	3225235	3	-	131567	  cellular organisms
 99.99	3225231	114	D	2	    Bacteria
 99.98	3224864	9	-	1783272	      Terrabacteria group
 99.92	3223162	2	P	201174	        Actinobacteria
 99.92	3223160	5	C	1760	          Actinobacteria
 99.92	3223117	21	O	85007	            Corynebacteriales
 99.92	3223094	87	F	1762	              Mycobacteriaceae
 99.92	3223004	2709350	G	1763	                Mycobacterium
 15.83	510550	287089	-	77643	                  Mycobacterium tuberculosis complex
  6.76	218100	200172	S	1773	                    Mycobacterium tuberculosis
  0.44	14259	14259	-	1334058	                      Mycobacterium tuberculosis TRS12

For reference, here is an accession that I have tested:

  • SRR6397355

You are right, edited the line to reflect 'Mycobacterium tuberculosis complex'

Since Kraken can sometimes be too general in its classification (Mycobacterium instead of Mycobacterium tuberculosis), would changing that line to accept "Mycobacterium" work better since then viable samples would not be discarded by UVP?

Actually, wouldn't the find("Mycobacterium tuberculosis") match the "Mycobacterium tuberculosis complex" in addition to all subsequent "Mycobacterium tuberculosis" containing lines?
The issue (I think) may lie in the Kraken database used.
I've compared Kraken results from the Galaxy server (using the bacteria database) and from a local machine (using the standard database) that show the same kind of result that matnguyen got. The Galaxy results showed MTBC cov values at > 90, while the local versions topped out at roughly 25.
Should I be looking at using the Kraken bacteria database instead?