snap-stanford/GEARS

Custom PertData ``new_data_process`` error

Yonggie opened this issue · 1 comments

According to custom data turorial,

(2) Create your own Perturb-Seq data
Prepare a scanpy adata object with
adata.obs dataframe has condition and cell_type columns, where condition is the perturbation name for each cell. Control cells have condition format of ctrl, single perturbation has condition format of A+ctrl or ctrl+A, combination perturbation has condition format of A+B.
adata.var dataframe has gene_name column, where each gene name is the gene symbol.
adata.X stores the post-perturbed gene expression.

custom data

dataset download: https://zenodo.org/records/7041849/files/AdamsonWeissman2016_GSM2406675_10X001.h5ad?download=1

data

  • adata.obs.columns.values: ['perturbation', 'read count', 'UMI count', 'tissue_type', 'cell_line', 'cancer', 'disease', perturbation_type', 'celltype', 'organism', 'ncounts', 'ngenes', 'percent_mito', 'percent_ribo', 'nperts']
  • adata.var.columns.values: ['ensembl_id', 'ncounts', 'ncells']

processing code

import scanpy
adata=scanpy.read_h5ad('./AdamsonWeissman2016_GSM2406675_10X001.h5ad')
# modifications:
# 1. adata.obs['perturbation]   gene_compound  => gene+compound
adata.obs['perturbation'] = adata.obs['perturbation'].str.replace('_', '+')  
adata.obs.rename(columns={'perturbation': 'condition'}, inplace=True)  
# 2. adata.obs['celltype'] => cell_type
adata.obs.rename(columns={'celltype': 'cell_type'}, inplace=True)  
# 3. adata.var  ensembl_id => gene_name
adata.var.rename(columns={'ensembl_id': 'gene_name'}, inplace=True)  

# condition should be in type str
adata.obs['condition']=adata.obs['condition'].astype(str)

pert_data.new_data_process(dataset_name = 'AdW1', adata = adata)

error:

ValueError: reference = lymphoblasts_ctrl_1 needs to be one of groupby = ['lymphoblasts_62(mod)+pBA581_1+1', 'lymphoblasts_*_1', 'lymphoblasts_BHLHE40+pDS258_1+1', 'lymphoblasts_CREB1+pDS269_1+1', 'lymphoblasts_DDIT3+pDS263_1+1', 'lymphoblasts_EP300+pDS268_1+1', 'lymphoblasts_SNAI1+pDS266_1+1', 'lymphoblasts_SPI1+pDS255_1+1', 'lymphoblasts_ZNF326+pDS262_1+1', 'lymphoblasts_nan_1']

except for the condition, cell_type, gene_name, X, what else preprocesses shall there be?

I am also having a similar issue with this dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE216595

Screenshot 2024-06-06 at 1 18 26 PM