Interpreting abundance profile
Closed this issue · 5 comments
Hi,
I'm a little confused at how to interpret the -abundance
file generated by MetaCache. In particular, I'm confused at how to interpret the results with -lower
set to subspecies
. In a simple mock I have, if I run MetaCache with -lower
at the default I get Salmonella enterica subsp. enterica reported at 8.9%. However, if I run MetaCache with -lower
set to subspecies
, I get Salmonella enterica subsp. enterica reported at 1.9%.
I'm unclear at why this discrepancy occurs? Which result is generally consider more accurate?
Thanks,
Donovan
Hi,
Related to the above question, I understand that -abundance-per
can be used to redistribute the assignment of reads based on a statistical argument. This seems useful and like it will often be the best estimate at any given rank. As such, is it possible to generate such results at multiple ranks (e.g. species, genus, family, ...). I'm often interested in results at multiple ranks.
Thanks,
Donovan
Hi Donovan,
regarding your first question:
Per default MetaCache considers the top 2 targets (sequences from different genomes) with the most hits in the database for each read. If both hit counts are similar, MetaCache uses the lowest common ancestor (LCA) of the two taxa for classification, which could be the common subspecies in your case. If you set -lowest
to subspecies MetaCache considers the top 2 subspecies with most hits and determines the LCA if the hit counts are close. In your case targets with less hits than the top 2 may come from a different subspecies, but still get a relevant number of hits. These would be discarded on the sequence level (default) but are taken into account on the subspecies level.
You can also use -maxcand
to increase the number of considered candidate targets.
Regarding -abundance-per
on multiple ranks:
It's currently not possible to generate results for multiple ranks in a single run. I would suggest to use the interactive query mode for multiple runs with different -abundance-per
settings. This way the database is loaded only once, which avoids a time consuming part of the query.
Great - thanks.
Hi,
How does one use "interactive query mode"? I can't seem to find documentation regarding this mode of operation.
Thanks.
Nevermind - found it. :)