Custom PertData ``new_data_process`` error
Yonggie opened this issue · 1 comments
According to custom data turorial,
(2) Create your own Perturb-Seq data
Prepare a scanpy adata object with
adata.obs dataframe has condition and cell_type columns, where condition is the perturbation name for each cell. Control cells have condition format of ctrl, single perturbation has condition format of A+ctrl or ctrl+A, combination perturbation has condition format of A+B.
adata.var dataframe has gene_name column, where each gene name is the gene symbol.
adata.X stores the post-perturbed gene expression.
custom data
dataset download: https://zenodo.org/records/7041849/files/AdamsonWeissman2016_GSM2406675_10X001.h5ad?download=1
data
- adata.obs.columns.values:
['perturbation', 'read count', 'UMI count', 'tissue_type', 'cell_line', 'cancer', 'disease', perturbation_type', 'celltype', 'organism', 'ncounts', 'ngenes', 'percent_mito', 'percent_ribo', 'nperts']
- adata.var.columns.values:
['ensembl_id', 'ncounts', 'ncells']
processing code
import scanpy
adata=scanpy.read_h5ad('./AdamsonWeissman2016_GSM2406675_10X001.h5ad')
# modifications:
# 1. adata.obs['perturbation] gene_compound => gene+compound
adata.obs['perturbation'] = adata.obs['perturbation'].str.replace('_', '+')
adata.obs.rename(columns={'perturbation': 'condition'}, inplace=True)
# 2. adata.obs['celltype'] => cell_type
adata.obs.rename(columns={'celltype': 'cell_type'}, inplace=True)
# 3. adata.var ensembl_id => gene_name
adata.var.rename(columns={'ensembl_id': 'gene_name'}, inplace=True)
# condition should be in type str
adata.obs['condition']=adata.obs['condition'].astype(str)
pert_data.new_data_process(dataset_name = 'AdW1', adata = adata)
error:
ValueError: reference = lymphoblasts_ctrl_1 needs to be one of groupby = ['lymphoblasts_62(mod)+pBA581_1+1', 'lymphoblasts_*_1', 'lymphoblasts_BHLHE40+pDS258_1+1', 'lymphoblasts_CREB1+pDS269_1+1', 'lymphoblasts_DDIT3+pDS263_1+1', 'lymphoblasts_EP300+pDS268_1+1', 'lymphoblasts_SNAI1+pDS266_1+1', 'lymphoblasts_SPI1+pDS255_1+1', 'lymphoblasts_ZNF326+pDS262_1+1', 'lymphoblasts_nan_1']
except for the condition, cell_type, gene_name, X
, what else preprocesses shall there be?
I am also having a similar issue with this dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE216595