How can I subset the model to a select few cell types/clusters before training the model?
yojetsharma opened this issue · 6 comments
I am using the Human_Developing_Brain.pkl as a model to annotate the query dataset. However, I am only interested in select few cell types/clusters. Is there a function to subset those clusters?
Thank you!
@yojetsharma, please refer to this question #128
I did try that:
>>> ref
CellTypist model with 129 cell types and 1000 features
date: 2022-10-29 21:02:53.713593
details: cell types from the first-trimester developing human brain
source: https://doi.org/10.1126/science.adf1226
version: v1
cell types: Brain erythrocytes, Brain fibroblasts, ..., Ventral midbrain radial glia
features: VWA1, HES5, ..., BGN
>>> celltypist.samples.downsample_adata(ref, n_cells=1000, by=(cell_types['Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 'Forebrain neuroblast', 'Telencephalon glioblast', 'Telencephalon neuron', 'Telencephalon radial glia', 'Telencephalon neuroblast', 'Telencephalon neuronal IPC']), mode='each', return_index=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'cell_types' is not defined
>>> celltypist.samples.downsample_adata(ref, n_cells=1000, by=['Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 'Forebrain neuroblast', 'Telencephalon glioblast', 'Telencephalon neuron', 'Telencephalon radial glia', 'Telencephalon neuroblast', 'Telencephalon neuronal IPC'], mode='each', return_index=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/miniconda3/envs/scarches/lib/python3.9/site-packages/celltypist/samples.py", line 89, in downsample_adata
celltypes = np.unique(adata.obs[by])
AttributeError: 'Model' object has no attribute 'obs'
@yojetsharma, if you are just selecting a subset of cell types, just use adata = adata[adata.obs.cell_types.isin(['Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 'Forebrain neuroblast', 'Telencephalon glioblast', 'Telencephalon neuron', 'Telencephalon radial glia', 'Telencephalon neuroblast', 'Telencephalon neuronal IPC'])].copy()
Right, thanks for this but does the following mean there is an installation error of the package on my end:
>>> ref_adata=ref[ref.cell_types.isin(['Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast', 'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC', 'Forebrain neuroblast', 'Telencephalon glioblast', 'Telencephalon neuron', 'Telencephalon radial glia', 'Telencephalon neuroblast', 'Telencephalon neuronal IPC'])].copy()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'isin'
I tried Boolean indexing on the reference model (downloded from celltypist models) since the model.cell_types is a direct NumPy array:
selected_cell_types = [
'Forebrain neuroblast', 'Forebrain neuronal IPC', 'Forebrain glioblast',
'Forebrain neuron', 'Forebrain radial glia', 'Forebrain OPC',
'Telencephalon glioblast', 'Telencephalon neuron',
'Telencephalon radial glia', 'Telencephalon neuroblast',
'Telencephalon neuronal IPC'
]
# Create a Boolean mask
mask = np.isin(adata.cell_types, selected_cell_types)
# Subset the AnnData object using the mask
ref= adata[mask].copy()
But the above still didn't work, most likely because the model is a not an anndata object. Does this mean i will need to downlaod this model from the source and make it as a model myself and then use it in the celltypist program?
@yojetsharma, if you try to subset the model, it is not possible. You need to subset your anndata and re-train the model using celltypist.train