single-cell-data/TileDB-SOMA

how to list values for query filter ?

Closed this issue · 2 comments

For exemple, is it possible to list all the possible diseases or cell types that can be used in the obs_value_filter ? Ideally in python ?

@HugoCornu we can put up a sample notebook -- for the moment though here's a snippet. The gist is, read as Pandas and use .groupby().

def count_obs(exp: tiledbsoma.Experiment, attr_name: str) -> None:
    print(
        exp.obs.read(column_names=[attr_name])
        .concat()
        .to_pandas()
        .groupby(attr_name)
        .size()
        .sort_values()
    )
>>> exp = tiledbsoma.Experiment.open(your_uri)
>>> exp.obs.schema
soma_joinid: int64 not null
obs_id: large_string not null
...
cell_type: large_string not null
...
>>> count_obs(exp, 'cell_type')
cell_type
enteroendocrine cell of small intestine            18
paneth cell of epithelium of small intestine       34
transit amplifying cell of small intestine         53
smooth muscle fiber of ileum                       54
mast cell                                          92
endothelial cell of lymphatic vessel               96
pericyte cell                                     121
glial cell                                        175
ileal goblet cell                                 230
progenitor cell                                   382
endothelial cell                                  565
fibroblast                                        571
enterocyte of epithelium proper of ileum          809
innate lymphoid cell                             1382
mononuclear phagocyte                            1635
B cell                                           3183
plasma cell                                      3898
native cell                                      4203
alpha-beta T cell                               14957
dtype: int64

Thanks for the answer and code !
I ll try it on cellxgene (~35 millions cells)
I was hoping for a solution that does not download to much lines.