jodyphelan/NTM-Profiler

Output columns

harrismia opened this issue · 7 comments

Hi Jody,

I tried running against one of our samples SRR315364.     Just to make sure I understand the output, could you explain Coverage and Coverage_SD ?   I  assume SD means sequencing depth? 

Thanks!

Michael

Screen Shot 2021-09-17 at 1 25 08 PM

Hi Michael,

Sorry I should put up some output descriptions!
The pipeline basically looks for kmers which I have associated with certain species. The coverage reports the average count for across the kmers for a species and the coverage_SD is the standard deviation. At the moment, a species is reported as present if >2/20 kmers are found. If this approach turns out to be too simplistic I might use other stats such as the SD in future, but at the moment it is just printed for info.

Hi Jody,

No worries.. Thanks for being so responsive!  We would be happy to help with this project.  It would be great to add resistance prediction capabilities.  At the moment we are focusing on M abscessus.  We are also interested in other clinically relevant NTM species.  We are planning to put together a spreadsheet of M. abscessus drug resistance positions from the literature.  Is there a format that would work well for developing drug resistance functionality in NTM Profiler?  

It would be great to have you on board with the development!
I think a format similar to tbdb would be a good start with the gene/mutation/drug combination.

I have seen some cases that wild-type genes can cause resistance and disruption can lead to increased sensitivity (e.g. in M. abscessus subsp. abscessus / erm(41) / macrolides). To detect resistance the software would have to check for an intact gene. This kind of function was not needed for TB, so I am in the process of adding it in. Others like point mutations in rrl can be stored in a similar way to how we do it in TB.

Great! Sorry for the delay in responding, I was out for a few days..

Interesting, about the wild-type genes causing resistance.

For format, do you think it is better to have a different database file for each species or would it be better to add a column for the species and only have one database file? I am thinking that the classifier can first classify and then if there are any known resistance variants for the identified species, the genotype status of these positions could be reported.

No worries!

I think it might be best to start off with having one database file per organism. This can then be grouped with the reference/annotation files.

I've had a go at starting the process. In a similar way to Mtb I've created a separate repo to store the DBs: https://github.com/jodyphelan/ntmdb/

I've added a few mutations for M. abscessus: https://github.com/jodyphelan/ntmdb/blob/main/db/mabs/mabs.csv

This is great! Thanks for getting started on it!