Doubt in terminology

Question

Doubt in terminology

Closed this issue a year ago · 2 comments

Hey thanks for the wonderful code and paper.

in the muscat paper and code when we say sample do we mean different batches of the same underlying starting material. So when we simulate for 2 samples and 4 clusters. we get two different batches of PBMCs which contain 4 clusters each so basically data with batch effects. or do we get PBMC-one sample and lets say BM another sample with 4 clusters each

Answer 1 · 2023-06-02T10:06:06.000Z


library(muscat)
data(example_sce)
sce_preped <- prepSim(example_sce, verbose = TRUE)

I am using this code example to generate the simulated data. What is the underlying true dataset used. When i use this code snippet. I see that the SCE object contains two samples labeled as ctrl does that mean that the data was simulated using the control and hence these are two technical replicates of the same starting material.

When I plot the PCA of the ground truth log transcription quotients of each cluster. I see that some of them overlap is there any explanation for this or is this an expected behavior of the simulation method. For a simulation with two samples and 4 clusters there are just 8 unique ltqs.

Answer 2 · 2023-07-26T13:33:32.000Z

I'm not sure I understand the "issue" completely, but will try to provide an answer as follows:

the "starting material" are some reference samples that are assumed to be biological or technical replicates, but with out differences in experimental condition. E.g., these could be from different patients or different days of measurement. So interpretation on what type of batches they are depends on the input.
the clusters are taken as is, i.e., according to the annotation provided.
the simulation will "replicate" the samples and clusters by estimating gene/cell parameters, sampling, and simulation counts from a NB.
the only "new" thing is the introduction of differential signal across groups for some genes. I.e., if there are 4 reference samples, 2 would be assigned to group A and 2 to group B. These would get library sizes and baseline expression values according to the reference, but some genes would get an additional "artificial" signal to introduce subpopulation-specific changes across groups.
if there are too few reference samples, e.g., 2 reference samples to generate 4 simulated samples, they will be used multiple times (there's a parameter controlling this), which might introduce unwanted artefacts. Similarly, when paired = TRUE, one reference sample will be used for both groups in order to mimic a paired design.
...hope this somehow resolves the unclarity. Though I believe the methods described in the paper describe this quite well, probably better than I did here ;)