Teichlab/celltypist

How can I subset the model to a select few cell types/clusters before training the model?

yojetsharma opened this issue · 6 comments

I am using the Human_Developing_Brain.pkl as a model to annotate the query dataset. However, I am only interested in select few cell types/clusters. Is there a function to subset those clusters?
Thank you!

@yojetsharma, please refer to this question #128

I did try that:

>>> ref
CellTypist model with 129 cell types and 1000 features
    date: 2022-10-29 21:02:53.713593
    details: cell types from the first-trimester developing human brain
    source: https://doi.org/10.1126/science.adf1226
    version: v1
    cell types: Brain erythrocytes, Brain fibroblasts, ..., Ventral midbrain radial glia
    features: VWA1, HES5, ..., BGN
>>> celltypist.samples.downsample_adata(ref, n_cells=1000, by=(cell_types['Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 'Forebrain neuroblast', 'Telencephalon glioblast', 'Telencephalon neuron', 'Telencephalon radial glia', 'Telencephalon neuroblast', 'Telencephalon neuronal IPC']), mode='each', return_index=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'cell_types' is not defined
>>> celltypist.samples.downsample_adata(ref, n_cells=1000, by=['Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 'Forebrain neuroblast', 'Telencephalon glioblast', 'Telencephalon neuron', 'Telencephalon radial glia', 'Telencephalon neuroblast', 'Telencephalon neuronal IPC'], mode='each', return_index=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/miniconda3/envs/scarches/lib/python3.9/site-packages/celltypist/samples.py", line 89, in downsample_adata
    celltypes = np.unique(adata.obs[by])
AttributeError: 'Model' object has no attribute 'obs'

@yojetsharma, if you are just selecting a subset of cell types, just use adata = adata[adata.obs.cell_types.isin(['Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 'Forebrain neuroblast', 'Telencephalon glioblast', 'Telencephalon neuron', 'Telencephalon radial glia', 'Telencephalon neuroblast', 'Telencephalon neuronal IPC'])].copy()

Right, thanks for this but does the following mean there is an installation error of the package on my end:

>>> ref_adata=ref[ref.cell_types.isin(['Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 'Forebrain neuroblast', 'Telencephalon glioblast', 'Telencephalon neuron', 'Telencephalon radial glia', 'Telencephalon neuroblast', 'Telencephalon neuronal IPC'])].copy()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'isin'

I tried Boolean indexing on the reference model (downloded from celltypist models) since the model.cell_types is a direct NumPy array:

selected_cell_types = [
    'Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 
    'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 
    'Telencephalon glioblast', 'Telencephalon neuron', 
    'Telencephalon radial glia', 'Telencephalon neuroblast', 
    'Telencephalon neuronal IPC'
]

# Create a Boolean mask
mask = np.isin(adata.cell_types, selected_cell_types)

# Subset the AnnData object using the mask
ref= adata[mask].copy()

But the above still didn't work, most likely because the model is a not an anndata object. Does this mean i will need to downlaod this model from the source and make it as a model myself and then use it in the celltypist program?

@yojetsharma, if you try to subset the model, it is not possible. You need to subset your anndata and re-train the model using celltypist.train