Scientific notation in abundance file result in rounding errors
donovan-h-parks opened this issue · 5 comments
Hi. We've run into a small issue that we are hoping can be fixed in the next release. The abundance profile produced with the -abundances
flag reports pair counts in scientific notation when numbers get large, e.g.:
# query summary: number of queries mapped per taxon
# rank name taxid number of reads abundance
domain Archaea 439684927 461 0.0130496%
domain Bacteria 609216830 1.05068e+06 29.7417%
This can result in small errors due to rounding. For example, in this case there is really 1050675 Bacterial read pairs, but it gets rounded up to 1050680. While having 5 extra read pairs is minor in terms of the resulting abundance estimates it makes it challenging to track the fait of all reads. In our code, we have a check that the number of input reads is equal to the number of reads in the MetaCache abundance profile (including unclassified). This is just a unit test to ensure our parsing is correct and that no reads are lost during any manipulation of data, but, more generally, not being able to account for all reads is a bit scary.
I imagine the intent is for this profile to produced integers, so am hoping this can be fixed in the next release. Thanks.
This is clearly a bug. I'll look into it.
Thanks André. Much appreciated.
The latest release should fix this issue.
@donovan-h-parks: could you maybe check, if this fixes your outputs?
Hi @muellan,
I'm still encountering this issue in v2.4.2, e.g.:
# query summary: number of queries mapped per taxon
# rank name taxid number of reads abundance
domain Bacteria 81602897 9.96505e+06 13.2788%
domain Archaea 1337977286 13005 0.0173297%
phylum Cyanobacteriota 15887558 3226 0.00429877%
phylum Fibrobacterota 31572563 34 4.53063e-05%
phylum Spirochaetota 39784000 1217 0.0016217%
...
I run MetaCache as follows:
> metacache query my_db P00001898_1.fq.gz P0000189_2.fq.gz -pairfiles -no-map -taxids -lineage -separate-cols -separator -threads 32 -abundances metacache_raw_profile.tsv -lowest subspecies -out metacache_read_classification.log
MetaCache version details:
> metacache info
------------------------------------------------
MetaCache version 2.4.2 (20240311)
database version 20200820
------------------------------------------------
...