the count matrices don't contain counts
jkobject opened this issue · 3 comments
Report
Hello,
I am seeing that the benchmark datasets used mention that you are taking the "counts" layer from these datasets. However, when looking at this layer I see values being floats instead of ints. Meaning that they are not counts.
The tool I want to benchmark only takes count matrices.
How should I get the count data?
Version information
No response
to verify that they don't contain counts: do
adata = sc.read(
"data/lung_atlas.h5ad",
backup_url="https://figshare.com/ndownloader/files/24539942",
)
adata.layers['counts'].sum()
same thing for the pancreas dataset.
Just having a float dtype does not imply that they are not count data. Most datasets are stored in a float32 format.
For that particular dataset, I would encourage you to read the original scib paper methods section.
If you're using a tool like scVI, it would technically work on data with decimals, like (1.03). The question is whether the non-count data are meant to represent count data. For example, pseudoaligners can provide probabilistic count values.
Hello Adam,
Thanks for the reply. I understand that even raw counts are often stored as float32, but here I see that some of the datasets used in this combined dataset have values that are not raw counts (meaning data with decimals).
I have not worked with probabilistic raw counts before. Are you saying that this is the reason why most of the 10x samples have decimal values?
Reading the methods section. It is saying that some datasets were unavailable as raw counts and they used the rpkm or tpms: so the counts also contain normalized data?
I am not sure how to continue with it if the data is depth normalized. I am working with my own model that is assuming that the counts are true counts..