How to download and use local soma directory from cellxgene_census.open_soma()?

Question

How to download and use local soma directory from cellxgene_census.open_soma()?

Alex2975 opened this issue 7 months ago · 7 comments

Dear Authors,

If I want to speed up retrieving the cells, can I download the soma folder only? Using aws s3 sync? Will this work:
aws s3 sync --no-sign-request s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/soma/

If the above works, then how should I open_soma()? Will this work, in which the /tmp/census_soma folder will contain the objects from s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/soma/:
with cellxgene_census.open_soma(uri="/tmp/census_soma") as census:

If the above works, in order to speed up the IO of getting cells, should I change the tilesdb_config, such as making the following buffer bigger?
with cellxgene_census.open_soma(tiledb_config={"py.init_buffer_bytes": 128 * 1024**2}) as census:

Thank you so much.

Answer 1 · 2024-06-24T20:52:46.000Z

Yes that should work. You can see our documentation related to opening up a local copy of Census here:

https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_aws_open_data.html#how-to-access-aws-census-data

in order to speed up the IO of getting cells, should I change the tilesdb_config, such as making the following buffer bigger?
with cellxgene_census.open_soma(tiledb_config={"py.init_buffer_bytes": 128 * 1024**2})

I recommend you use the defaults, which ensure memory utilization of no more than 1GB of memory. With a local copy of Census that will work out just great. Your config defines 0.1 GB which is actually pretty low.

I do think increasing the buffer size may offer some advantages for a local copy. If you do desire to do so I recommend 8GB:

{
    "py.init_buffer_bytes": 8 * 1024**3,
    "soma.init_buffer_bytes": 8 * 1024**3,
}

Answer 2 · 2024-06-24T23:32:03.000Z

Thank you very much for the instructions, @pablo-gar .

Answer 3 · 2024-06-25T02:06:14.000Z

@pablo-gar , regarding the normalization for Smart-Seq (feat: the normalized layer should contain gene-length normalized counts from SmartSeq data #813), is it done and available for the latest release (2023-12-15)? Thank you very much.

Answer 4 · 2024-06-25T13:35:21.000Z

@pablo-gar , would you please also comment on why the duplicated cells come from? One possible way I can think about is duplicated cells come from the authors submitted the same cells in multiple h5ad files. Could that be possible? Are there other scenarios that could result in duplicated cells? Thank you very much.

Answer 5 · 2024-06-25T21:21:56.000Z

@pablo-gar regarding the normalization for Smart-Seq (feat: the normalized layer should contain gene-length normalized counts from SmartSeq data #813), is it done and available for the latest release (2023-12-15)? Thank you very much.

No, you can access the normalized layer with that fix in the "latest non-LTS version" of Census data (census_version = "latest"). We will publish the new LTS next week, you can also wait for that one.

Then in get_anndata()you can use the X_name or X_layers to get the layer.

Answer 6 · 2024-06-25T21:29:04.000Z

@pablo-gar , would you please also comment on why the duplicated cells come from? One possible way I can think about is duplicated cells come from the authors submitted the same cells in multiple h5ad files. Could that be possible? Are there other scenarios that could result in duplicated cells? Thank you very much.

The scenarios where that happens is:

Multiple datasets of the same collection contain some level of duplication. For example Tabula Sapiens has an "All cells" dataset and then datasets per compartment.
Meta-analysis of existing data elsewhere in CELLxGENE. For example this Azimuth dataset

Answer 7 · 2024-06-26T23:12:13.000Z

Great, thank you so much for the insights, @pablo-gar .