wmacnair/SampleQC

Feedback on SampleQC usage

Closed this issue · 8 comments

  • I noticed that make_qc_dt requires sample_id, while embed_sample_to_sample_MMDs looks for patient_id.

  • Running make_qc_dt gave this warning: found 0 mitochondrial genes, which is less than the expected 13 human genes; you may want to check your data.
    I see it looks for gene names in rownames(sce): I'd be nice to specify where gene names have to be stored in the manual.

  • Concerning counts/log_counts etc..., do you normalize counts/log_counts or use raw values?

  • concerning the html report:

  • it'd be great if users could click on plots to get the sample name, e.g., in 'αj PCA values' plots.

  • are QC1 and QC2 supposed to split samples according to their group? in my plot it simply splits samples in 2, but not according to group.
    Maybe that's the automatic clustering of samples, according to QC similarity, right ?
    If so, then what is the "condition" variable used for?

Code used:

rm(list =ls())
library('SampleQC')
library(muscData)

sce = Kang18_8vs8()

# specify the sample_id and condition
sce$sample_id = paste0(sce$ind, sce$stim, paste = "_")
sce$condition = sce$stim

qc_dt   = make_qc_dt(sce)
# gene names are in rownames(sce):
rownames(sce)

# which QC metrics do we want to use? (the most important bit)
qc_names        = c('log_counts', 'log_feats', 'logit_mito')
# which discrete-valued variables do we want to annotate the samples with?
annot_discrete  = c('well_id', 'patient_id', 'condition')
# which continuous-valued variables do we want to annotate the samples with?
annot_cont      = NULL

mmd_list    = calculate_sample_to_sample_MMDs(qc_dt, qc_names, subsample=200, n_times=20, n_cores=16)
mmd_list    = embed_sample_to_sample_MMDs(mmd_list, qc_dt, annot_discrete, annot_cont, n_nhbrs=5)

print(table(mmd_list$mmd_clusts))

set.seed(123)
system.time({
  em_list     = fit_sampleQC(mmd_list, qc_dt, qc_names, K_list=c(1,1,1,1))
})
# Once you've fit everything, you can render an html report and check whether it makes sense.

# define you project name and where you want to save these reports
proj_name   = 'Kang_ex'
save_dir    = '~/Desktop/Will SampleQC/Kang_ex/'
dir.create(save_dir)

# render the report!
make_SampleQC_report(mmd_list, em_list, save_dir, proj_name)

Thanks for these.

I'm slightly confused by the first one, regarding patient_id. Could you describe this a bit more? I've checked and I can't find patient_id anywhere in the package code.

On counts/log_counts, I don't normalize - typically normalization is a later step than QC.

You said "users could click on plots to get the sample name". Could you expand a bit here? Do you mean something like hovertext?

You asked about QC1 and QC2. Yes you're right, this is the automatic clustering of samples. The idea with condition (or whichever other annotation variable a user is interested in) is that you can check whether condition is associated with particular behaviour in the QC metrics, which might indicate presence of batch effects.

The Warning about patient_id comes from embed_sample_to_sample_MMDs (try to run the code I sent above):

mmd_list    = embed_sample_to_sample_MMDs(mmd_list, qc_dt, annot_discrete, annot_cont, n_nhbrs=5)
Warning message:
In embed_sample_to_sample_MMDs(mmd_list, qc_dt, annot_discrete,  :
  the following variables were requested as discrete annotations, but aren't present in qc_dt:
well_id, patient_id

Got it with QC1 and QC2, maybe just worth writing this in the output html (if it's not already there).

With "users could click on plots to get the sample name", I mean that the sample ids are not present in the plots from the "αj PCA values".
It'd be nice to have, either, the name of the samples in the plot (as for the MDS plots), or (fancier but probably less useful) an interactive plot where the name pops-up when clicking on the point.

Ok, thanks.

The patient_id, well_id warnings are because these are specified in the variable annot_discrete. So hopefully if you only annotation variables which are in your dataset, it should work. I'll update the readme to make this clearer.

Adding names of sample_id s to the plots is a good idea, I'll add that

A further comment: in the MMD plot, sample names are only present in the vertical axis, while absent on the horizontal one.
I'd be good if sample names were present in both axes (maybe at a 45 or 90 degree angle), facilitating matchings between pairs of samples (I currently get lost!).

Thanks for this. I'll add a flag for users to to switch the labels on / off.

  • User switch for axis labels in MMD distance matrix heatmp

Only a couple of these are still relevant:

  • I've removed the axis labels from the MMD heatmap
  • The mito issue is a duplicate of #10