About data preprocessing

Question

About data preprocessing

Closed this issue 2 years ago · 2 comments

When we loaded dlpfc151673 in BANKSY, we found that the expression profile of this object includes 7319 genes and 3639 spots, but your article only mentions the selection of top 2000 variable genes. We are very interested in your work, can you describe in detail how your data preprocessing is done? Thank you in advance here.

Answer 1 · 2022-06-09T02:47:57.000Z

Hi @huoyuying,

Thank you for your interest in our work.

Subject-level analysis

The datasets attached with the R package serve to demonstrate subject-level analysis, where BANKSY is applied to slices from the same subject with spots from the slices jointly clustered. This was not included in the preprint, but the vignette for this analysis is presented here.

Here, we used 3000 variable features for each slice, then took the union of these features, obtaining 7319 features in total. The description of the data and its processing can be found here.

To find variable features for each slice, we used Seurat's implementation of variance stabilizing transform. Here is the code in detail:

# Read data stored in .rds - gene-cell matrix
counts <- readRDS(paste0('../data/expression-', sample, '.rds'))

# Normalize to rounded median library size with relative counts normalization
lib_size <- ceiling(median(colSums(counts)) / 1000) * 1000
seu <- CreateSeuratObject(counts = counts)
seu <- NormalizeData(seu, normalization.method = 'RC', scale.factor = lib_size)

# Obtain variable features with vst
seu <- FindVariableFeatures(seu, selection.method = 'vst', nfeatures = nfeat)
vars <- VariableFeatures(seu)

Preprint analysis

In the preprint, we presented analysis on running BANKSY on each slice individually and used 2000 variable features, as that was most comparable to what most other methods did. This is what is described in the preprint.

Here is the code in detail, including data processing and clustering:

# Read data stored in .rds - gene-cell matrix and spot locations
counts <- readRDS(paste0('../data/expression-', sample, '.rds'))
locs <- readRDS(paste0('../data/locations-', sample, '.rds'))
bank <- BanksyObject(own.expr = counts, cell.locs = locs)

# Process data and subset to top 2000 variable features
lib_size <- ceiling(median(colSums(counts)) / 1000) * 1000
bank <- NormalizeBanksy(bank, normFactor = lib_size)
bank <- ComputeBanksy(bank, spatialMode = 'kNN_r', k_geom = 18)
bank <- ScaleBanksy(bank)
var_feat <- readRDS(paste0('../var_features/', sample, '.rds'))
bank <- SubsetBanksy(bank, features = var_feat)
	
# Cluster on 30 PCs	    
bank <- RunPCA(bank, lambda = 0.15, npcs = 30)
set.seed(1000)
bank <- ClusterBanksy(bank, lambda = 0.15, method = 'leiden', k.neighbors = seq(30, 100, by=10), resolution=seq(0.3, 1.2, by=0.1)

Hope this clarifies.

Answer 2 · 2022-06-09T08:19:22.000Z

Thanks for your reply, it worked for me. @jleechung