ftwkoopmans/msdap

Coding of target and control groups in DEA

Closed this issue · 5 comments

Thanks for this great package!

I noted, however, that the coding of groups for DEA is not what I would expect. Usually, comparisons labeled A vs B indicate that the fold change will be computed as A/B (or log2(A/B)). However, setup_contrasts() seems to code group A as 1 and group B as 2 (see below). This will be interpreted by limma lmFit in a way that computes log2FC = log2(B/A) (see also the limma user guide, chapter 9.2). I did not test the other methods, but for ebayes all logFC came out with the wrong sign compared to the average expression differences between groups.

mask = as.integer(tib$group %in% groups_a)
mask[tib$group %in% groups_b] = 2

Could this behavior be changed?

This was also confusing me, but I have seen in the userguide it is intended like that:
"typical wild-type vs knockout;
dataset = setup_contrasts(dataset, contrast_list = list(c("WT", "KO")))"
so I did not complain against that, but for me (and in most programs) also x vs y means y is denominator (compare against that), and we compare against WT or control as it is in MS-DAP.

I also have seen people using it the other way around, but I still feel that A vs B means FC=A/B.
However, at least it should be described in the manual/user guide that A vs B will compute the fold change as B/A.

Indeed, in MS-DAP a contrast for "A vs B" returns foldchanges for B/A. Good feedback, I will update the userguide to make this more clear.

Not sure if I'll add an option to change this behaviour soon though. If MS-DAP user X would use that option to flip from B/A to A/B, we must ensure that person Y who uses that output is aware of this change as well. We could e.g. annotate respective figures in report.pdf to indicate foldchanges have been flipped and perhaps add a keyword to the output Excel tables. But this would place a burden (/potential pitfall) on anyone using the MS-DAP output (e.g. some R script that operates on the DEA output Excel table, or dataset.RData file) to also check for "flipped foldchange". Better solution is probably to just setup your contrasts in reverse;
dataset = setup_contrasts(dataset, contrast_list = list(c("WT", "KO")))"
-->> change to ;
dataset = setup_contrasts(dataset, contrast_list = list(c("KO", "WT")))"

Here's some code to double-check this MS-DAP behaviour by processing a LFQbench DIA dataset included with the package and visualizing the input data and MS-DAP DEA results (perhaps these figures should go into the userguide?);

library(msdap)

### process Skyline DIA results from LFQbench study
f <- system.file("extdata", "Skyline_HYE124_TTOF5600_64var_it2.tsv.gz", package = "msdap")
dataset = import_dataset_skyline(f, confidence_threshold = 0.01, return_decoys = F, acquisition_mode = "dia")

# sample group definitions & setup contrast
dataset = sample_metadata_custom(dataset, group_regex_array = c(A = "007|009|011", B = "008|010|012") )
dataset = setup_contrasts(dataset, contrast_list = list(c("A", "B")))

# classify proteingroups per species
dataset$proteins$classification = regex_classification(dataset$proteins$fasta_headers, regex=c(human="_HUMA", yeast="_YEAS", ecoli="_ECOL"))

# feature selection: only peptides detected in 3+ replicates in each sample group, then apply normalization (vwmb algorithm)
dataset = filter_dataset(dataset,
                         filter_min_detect = 3,
                         norm_algorithm = c("vwmb", "modebetween_protein"),
                         by_group = F, all_group = T, by_contrast = F)

# DEA: apply limma's eBayes to each contrast and flag proteins as significant at 5% FDR and foldchange larger than a threshold estimated from bootstrap analyses (specified by parameter; fc_signif=NA)
dataset = dea(dataset, algo_de = "ebayes", qval_signif = 0.05, fc_signif = NA)


### plot peptide abundance distributions per sample as boxplots, split/facet by proteingroup species classifications
tib_plot = dataset$peptides %>% 
  # join peptide-level data with proteingroup classifications and sample metadata
  left_join(dataset$proteins, by="protein_id") %>% 
  left_join(dataset$samples %>% select(sample_id, group), by="sample_id") %>%
  # remove proteingroups that match multiple species
  filter(classification != "discard") %>%
  # sample labels for downstream ggplot
  mutate(sample_label = paste0("group:", group, " ", sample_id),
         sample_label = factor(sample_label, levels = sort(unique(sample_label), decreasing = T)) )

ggplot(tib_plot, aes(x = intensity, y = sample_label, fill = group)) + 
  geom_boxplot(show.legend = F) + 
  facet_wrap(.~classification, ncol = 1) + 
  labs(x = "log2 peptide intensity", y = "", title = "ecoli abundances are lower in group A, higher in group B\nyeast proteins are the opposite")


### plot DEA results; foldchange distributions, color-code by proteingroup species classifications
ggplot(dataset$de_proteins %>% left_join(dataset$proteins, by="protein_id"), aes(foldchange.log2, colour = classification)) + 
  geom_density() +
  labs(title = "contrast: 'A vs B' -->> if log2 foldchanges are positive,\nabundance values are higher in group B as compared to A\ni.e. MS-DAP output foldchanges are log2(B/A).\ne.g. 'control vs disease' -->> positive log2 foldchange implies values are up in disease")

Thanks for your reply. I am fine with the current behavior as long as it is documented. I would appreciate if this could also be reflected in the function reference (e.g. ?setup_contrasts), since this is where I would look first and this is also potentially more long-lived than the user guides.

I don't believe changing control and target groups in the contrast definition would help, as this would also change the contrast label in the same direction.

For my purpose, I used the fact that setup_contrasts adds per-contrast columns to dataset$sample, and flipped the groups manually like that:

for (cc in dataset_contrasts(dataset)) {
  dataset$samples[[cc]] <- ifelse(dataset$samples[[cc]]==1, 2, ifelse(dataset$samples[[cc]]==2, 1, dataset$samples[[cc]]))
}

This would keep the contrast labels but change the direction of the contrast. As I said, IMO there is no need to actually add this to the package.

Thanks again for the feedback and helpful discussions.

We've just released version 1.0.3 and documented the MS-DAP foldchange behaviour at;

  • report.pdf , at the start of the "Differential abundance analysis" section
  • differential_abundance_analysis.xlsx , in the table legends on sheet 1
  • setup_contrasts() R function documentation
  • dea() R function documentation