theislab/scib-reproducibility

Data provided do not have the raw counts in the 'counts layer'

Opened this issue · 6 comments

Hi,

Thank you for the nice tool and resource! I downloaded the lung and the human immune data from the figshare website, but found there were no raw counts data in the adata object. For example, from https://figshare.com/ndownloader/files/25717328, I downloaded the data:

adata.layers['counts']
array([[ 0.  ,  0.  ,  0.  , ...,  1.  ,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  0.  ],
       ...,
       [ 0.  ,  0.  ,  0.  , ..., 54.67,  0.  , 93.26],
       [ 0.  ,  0.  ,  0.  , ..., 14.62,  0.  , 84.9 ],
       [ 0.  ,  0.  ,  0.  , ...,  5.98,  0.  ,  0.  ]], dtype=float32)

they are not integers. Btw, I got the warning when import the data:

 OldFormatWarning: Element '/layers/counts' was written without encoding metadata.
  return {k: read_elem(v) for k, v in elem.items()}

the version of scanpy and anndata are

scanpy==1.9.1 anndata==0.8.0

Thank you so much!

Best,
Min

I noticed this about the counts too...

I am not sure exactly what was uploaded to FigShare. We would need to ask @LuckyMD about that but he is unavailable for the next few months.

 OldFormatWarning: Element '/layers/counts' was written without encoding metadata.
  return {k: read_elem(v) for k, v in elem.items()}

This warning is because the files were written with an older version of anndata and you are using v0.8.0 which expects a different file format. It should be back-compatible though so no need to worry about this.

Hi @wconnell and @genecell,

Sorry for the late reply here. The reason not all of these are integers is that we use TPMs as "raw counts" for full-length data without UMIs. I believe this is mentioned in the methods section of the paper as well. In the immune dataset, the Villani data were measured using Smart-seq2. We don't have raw read counts for this dataset, but instead use TPMs which are already gene length corrected after alignment. I hope that clarifies things.

Lung data should have integer counts though afaik, as there are no full length data in that task... did you find this issue also for the lung data?

Thank you for clarifying @LuckyMD; I found the detail in the Sup Info that Villani was excluded from scran norm b/c only TPM was provided.

I'm not sure about the lung data.

Hi @lazappi @LuckyMD @wconnell Thank you for your responses! yeah, the lung data did not have the integer counts, as I imported the data via:

adata = sc.read(
    "data/lung_atlas.h5ad",
    backup_url="https://figshare.com/ndownloader/files/24539942",
)

The integer counts are important as some methods rely on raw counts data.
Thank you very much!

Best regards,
Min