Coding of target and control groups in DEA
Closed this issue · 5 comments
Thanks for this great package!
I noted, however, that the coding of groups for DEA is not what I would expect. Usually, comparisons labeled A vs B indicate that the fold change will be computed as A/B (or log2(A/B)). However, setup_contrasts() seems to code group A as 1 and group B as 2 (see below). This will be interpreted by limma lmFit in a way that computes log2FC = log2(B/A) (see also the limma user guide, chapter 9.2). I did not test the other methods, but for ebayes all logFC came out with the wrong sign compared to the average expression differences between groups.
mask = as.integer(tib$group %in% groups_a)
mask[tib$group %in% groups_b] = 2
Could this behavior be changed?
This was also confusing me, but I have seen in the userguide it is intended like that:
"typical wild-type vs knockout;
dataset = setup_contrasts(dataset, contrast_list = list(c("WT", "KO")))"
so I did not complain against that, but for me (and in most programs) also x vs y means y is denominator (compare against that), and we compare against WT or control as it is in MS-DAP.
I also have seen people using it the other way around, but I still feel that A vs B means FC=A/B.
However, at least it should be described in the manual/user guide that A vs B will compute the fold change as B/A.
Indeed, in MS-DAP a contrast for "A vs B" returns foldchanges for B/A. Good feedback, I will update the userguide to make this more clear.
Not sure if I'll add an option to change this behaviour soon though. If MS-DAP user X would use that option to flip from B/A to A/B, we must ensure that person Y who uses that output is aware of this change as well. We could e.g. annotate respective figures in report.pdf to indicate foldchanges have been flipped and perhaps add a keyword to the output Excel tables. But this would place a burden (/potential pitfall) on anyone using the MS-DAP output (e.g. some R script that operates on the DEA output Excel table, or dataset.RData file) to also check for "flipped foldchange". Better solution is probably to just setup your contrasts in reverse;
dataset = setup_contrasts(dataset, contrast_list = list(c("WT", "KO")))"
-->> change to ;
dataset = setup_contrasts(dataset, contrast_list = list(c("KO", "WT")))"
Here's some code to double-check this MS-DAP behaviour by processing a LFQbench DIA dataset included with the package and visualizing the input data and MS-DAP DEA results (perhaps these figures should go into the userguide?);
library(msdap)
### process Skyline DIA results from LFQbench study
f <- system.file("extdata", "Skyline_HYE124_TTOF5600_64var_it2.tsv.gz", package = "msdap")
dataset = import_dataset_skyline(f, confidence_threshold = 0.01, return_decoys = F, acquisition_mode = "dia")
# sample group definitions & setup contrast
dataset = sample_metadata_custom(dataset, group_regex_array = c(A = "007|009|011", B = "008|010|012") )
dataset = setup_contrasts(dataset, contrast_list = list(c("A", "B")))
# classify proteingroups per species
dataset$proteins$classification = regex_classification(dataset$proteins$fasta_headers, regex=c(human="_HUMA", yeast="_YEAS", ecoli="_ECOL"))
# feature selection: only peptides detected in 3+ replicates in each sample group, then apply normalization (vwmb algorithm)
dataset = filter_dataset(dataset,
filter_min_detect = 3,
norm_algorithm = c("vwmb", "modebetween_protein"),
by_group = F, all_group = T, by_contrast = F)
# DEA: apply limma's eBayes to each contrast and flag proteins as significant at 5% FDR and foldchange larger than a threshold estimated from bootstrap analyses (specified by parameter; fc_signif=NA)
dataset = dea(dataset, algo_de = "ebayes", qval_signif = 0.05, fc_signif = NA)
### plot peptide abundance distributions per sample as boxplots, split/facet by proteingroup species classifications
tib_plot = dataset$peptides %>%
# join peptide-level data with proteingroup classifications and sample metadata
left_join(dataset$proteins, by="protein_id") %>%
left_join(dataset$samples %>% select(sample_id, group), by="sample_id") %>%
# remove proteingroups that match multiple species
filter(classification != "discard") %>%
# sample labels for downstream ggplot
mutate(sample_label = paste0("group:", group, " ", sample_id),
sample_label = factor(sample_label, levels = sort(unique(sample_label), decreasing = T)) )
ggplot(tib_plot, aes(x = intensity, y = sample_label, fill = group)) +
geom_boxplot(show.legend = F) +
facet_wrap(.~classification, ncol = 1) +
labs(x = "log2 peptide intensity", y = "", title = "ecoli abundances are lower in group A, higher in group B\nyeast proteins are the opposite")
### plot DEA results; foldchange distributions, color-code by proteingroup species classifications
ggplot(dataset$de_proteins %>% left_join(dataset$proteins, by="protein_id"), aes(foldchange.log2, colour = classification)) +
geom_density() +
labs(title = "contrast: 'A vs B' -->> if log2 foldchanges are positive,\nabundance values are higher in group B as compared to A\ni.e. MS-DAP output foldchanges are log2(B/A).\ne.g. 'control vs disease' -->> positive log2 foldchange implies values are up in disease")
Thanks for your reply. I am fine with the current behavior as long as it is documented. I would appreciate if this could also be reflected in the function reference (e.g. ?setup_contrasts), since this is where I would look first and this is also potentially more long-lived than the user guides.
I don't believe changing control and target groups in the contrast definition would help, as this would also change the contrast label in the same direction.
For my purpose, I used the fact that setup_contrasts adds per-contrast columns to dataset$sample, and flipped the groups manually like that:
for (cc in dataset_contrasts(dataset)) {
dataset$samples[[cc]] <- ifelse(dataset$samples[[cc]]==1, 2, ifelse(dataset$samples[[cc]]==2, 1, dataset$samples[[cc]]))
}
This would keep the contrast labels but change the direction of the contrast. As I said, IMO there is no need to actually add this to the package.
Thanks again for the feedback and helpful discussions.
We've just released version 1.0.3 and documented the MS-DAP foldchange behaviour at;
- report.pdf , at the start of the "Differential abundance analysis" section
- differential_abundance_analysis.xlsx , in the table legends on sheet 1
- setup_contrasts() R function documentation
- dea() R function documentation