muellan/metacache

Scientific notation in abundance file result in rounding errors

Opened this issue · 5 comments

Hi. We've run into a small issue that we are hoping can be fixed in the next release. The abundance profile produced with the -abundances flag reports pair counts in scientific notation when numbers get large, e.g.:

# query summary: number of queries mapped per taxon
# rank 	name 	taxid 	number of reads 	abundance
domain 	Archaea 	439684927 	461 	0.0130496%
domain 	Bacteria 	609216830 	1.05068e+06 	29.7417%

This can result in small errors due to rounding. For example, in this case there is really 1050675 Bacterial read pairs, but it gets rounded up to 1050680. While having 5 extra read pairs is minor in terms of the resulting abundance estimates it makes it challenging to track the fait of all reads. In our code, we have a check that the number of input reads is equal to the number of reads in the MetaCache abundance profile (including unclassified). This is just a unit test to ensure our parsing is correct and that no reads are lost during any manipulation of data, but, more generally, not being able to account for all reads is a bit scary.

I imagine the intent is for this profile to produced integers, so am hoping this can be fixed in the next release. Thanks.

This is clearly a bug. I'll look into it.

Thanks André. Much appreciated.

The latest release should fix this issue.

@donovan-h-parks: could you maybe check, if this fixes your outputs?

Hi @muellan,

I'm still encountering this issue in v2.4.2, e.g.:

# query summary: number of queries mapped per taxon
# rank 	name 	taxid 	number of reads 	abundance
domain 	Bacteria 	81602897 	9.96505e+06 	13.2788%
domain 	Archaea 	1337977286 	13005 	0.0173297%
phylum 	Cyanobacteriota 	15887558 	3226 	0.00429877%
phylum 	Fibrobacterota 	31572563 	34 	4.53063e-05%
phylum 	Spirochaetota 	39784000 	1217 	0.0016217%
...

I run MetaCache as follows:

> metacache query my_db P00001898_1.fq.gz P0000189_2.fq.gz -pairfiles -no-map -taxids -lineage -separate-cols -separator      -threads 32 -abundances metacache_raw_profile.tsv -lowest subspecies -out metacache_read_classification.log

MetaCache version details:

> metacache info
------------------------------------------------
MetaCache version    2.4.2 (20240311)
database version     20200820
------------------------------------------------
...