smorabit/hdWGCNA

Question about K value in MetacellsByGroups

RafaellaFerraz opened this issue · 1 comments

Hi! Thank you for creating an awesome package!

I have the following situation in Metacells Construction: When I put K = 20, the number of MetaCells in groupA is 1074 and groupB is 242. When I put K = 15 the number of metacells in groupA is 2602 and groupB is 1673. It is more interesting to maintain a better balance of cells between groups (k=15) or to increase the number of K (k=20) in consensus network?

In relation to reduction, is it preferable to use CCA reduction (removing batch effect) or PCA?

Is this parameters ok?

seuratobj_meta <- hdWGCNA::MetacellsByGroups(
 seurat_obj = seuratobj,
 group.by = c("recluster_0.6", "orig.ident", "source_name"),
 reduction = 'cca', 
 k = 15,
 ident.group = 'recluster_0.6',
 assay = "RNA",
 layer = "counts",
 min_cells = 50,
 max_shared = 10, 
 target_metacells = 1000,
 verbose = TRUE,
 wgcna_name = "wcgna_mCRPC")

Hi, thanks for your interest in hdWGCNA.

This is a good question about dataset imbalance, which I think is broadly applicable in data science & bioinformatics. I have not actually tested this question with hdWGCNA and how imbalanced datasets impact the downstream analysis, because typically from the start of the experimental design for my studies we have tried to balance our conditions. However I understand for various reasons this will not always be the case. I don't know for sure but I would guess that dataset imbalance would impact your downstream analysis with hdWGCNA.

As you have suggested you can select the k value which gives you a better dataset balance. However I can show you an alternative where you can just sample your metacells after they have been constructed so you have the same number in each group. In this example I construct metacells by biological Sample and I will downsample so there are the same number of metacells per Sample.



# compute metacells
seurat_obj <- MetacellsByGroups(
    seurat_obj = seurat_obj,
    group.by = c("Timepoint", "Patient", "Sample"),
    k = 50, 
    max_shared=5,
    ident.group = "Sample",
    reduction='pca',
    target_metacells=1000,
    min_cells = 100
)

# get the metacell object from the Seurat obj
m_obj <- GetMetacellObject(seurat_obj)
m_obj$metacell_id <- colnames(m_obj)

# downsample to the lowest group:
n_downsample <- as.numeric(min(table(m_obj$Sample)))

# get the list of metacells to keep by sampling each group separately
metacells_downsample <- m_obj@meta.data %>%
    group_by(Sample) %>% 
    sample_n(n_downsample) %>% 
    ungroup %>% .$metacell_id

# subset the metacell object
m_obj <- m_obj[,metacells_downsample]

# reset the metacell object
seurat_obj <- SetMetacellObject(seurat_obj, m_obj)


You should be able to proceed with the rest of the hdWGCNA pipeline using the downsamples metacells.