ftwkoopmans/msdap

test for interaction effects?

jgn301 opened this issue · 3 comments

Dear Frank,
thanks for your great package. I was wondering if it is possible to analyse interaction effects in differential expression analysis.
For example, suppose that I have a wildtype and a mutant organism, and I treat both with some chemical compound. I would like to know if the chemical compounds affects the mutant differently than the wildtype, e.g. if the mutation makes the organism resistant to the chemical compound.
Thank you and kind regards,
jgn301

Hi,

Currently, statistical analysis in MS-DAP is designed for A/B testing. I’ll preface with some background/limitations and then describe 2 possible solutions to your question / use-case.

MS-EmpiRe is a statistical model that works really well in our hands but it only supports A/B testing. DEqMS and MSqRob basically apply linear regression models, so with MS-DAP users can add random variables (columns in sample metadata table) to the linear regression model when A/B testing. The MSqRob implementation/code states that contrasts can only be constructed from variables that were specified as a random effect (i.e. no interaction). So we can only do A/B testing with the peptide-level statistical models. But with DEqMS, which is an extension of limma, we could run some custom code to apply any design matrix (second solution described below).

Simple solution;

1) use MS-DAP as per usual to test the following two independent contrasts using DEqMS or MS-EmpiRe;
A) “wildtype organism with control chemical” versus “wildtype organism with chemical X”
B) “mutant organism with control chemical” versus “mutant organism with chemical X”

2) compare the significant hits between both contrasts;

  • Scatterplot the subset of proteins (protein-groups really) that were significant at 5% FDR in either contrast, such that the x-axis and y-axis show the log2 foldchange (or effectsize) of contrast 1 and 2, respectively.
  • Color-code proteins by -log10 qvalue in contrast 1. Repeat for color-coding by contrast 2.
  • We did this for a study in our lab earlier, and it clearly highlighted proteins differentially affected by a chemical compound between wildtype and mutant mice (i.e. if there is no difference, all proteins would be shown as a diagonal line).

Solution 2, writing code;

  • import dataset, fasta, metadata as usual
  • apply peptide filtering and normalization
  • run custom DEqMS code with your design matrix of choice hardcoded

I’ve added an implementation below, which I tested on an in-house dataset. I’m quite pressed for time with a few deadlines coming up the next month so I’m unable to create a new MS-DAP release with this feature integrated atm, but I think this code should help you do the stats in any case. Do note the code comments (i.e. "!! adjust params") as you’ll have to tweak to your situation, especially to make sure your input sample metadata table matches the formula used to generate the design matrix!

You’ll find the statistics for all coefficients in the result variable. You’ll have to filter this table for the coefficient you’re interested in and then you can work with the results.

In this reference authors show various experimental designs and how to apply them using limma (same principles apply to DEqMS example code below); https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7873980/ . Check for instance Figure 9, which is the example I here implemented. But with this snippet of example code you should be able to run any of the described design matrices.

I hope this helps! Please let me know if everything worked out, and feel free to send a message if you have any questions/suggestions/find bugs.

Cheers,
Frank

R code

#### step 1; import data as per usual
library(msdap)
## !! adjust params; update file paths and import function for your dataset
dataset = import_dataset_spectronaut(filename="C:/DATA/.../report.tsv", confidence_threshold = 0.01, use_normalized_intensities = FALSE, do_plot = TRUE)
dataset = import_fasta(dataset, files = c("C:/.../UP000000589_10090.fasta","C:/DATA/.../UP000000589_10090_additional.fasta"))
dataset = import_sample_metadata(dataset, filename = "C:/DATA/.../sample_metadata.xlsx")
# print sample metadata matrix. Importantly, these column names are the variable names available in the DEqMS design matrix later on
print(dataset$samples, n=Inf)



#### step 2; apply filtering and normalization
## !! adjust params; these are settings I used when testing a DIA dataset with 5 replicates per phenotype*drug combination. if you have many more reps, increase.
# if you use DDA, reduce filter_min_detect (i.e. you don't need MS/MS identification of each peptide in all samples ofc.)
dataset = filter_dataset(dataset, filter_min_detect = 3, filter_min_quant = 3, norm_algorithm = c("vwmb","modebetween_protein"), by_group = F, by_contrast = F, all_group = T)



#### step 3; apply DEqMS
# subset of peptides that passed filtering
peptides_for_contrast = dataset$peptides %>%
  select(sample_id, protein_id, peptide_id, intensity = intensity_all_group) %>% # have to rename intensity column for downstream code
  filter(is.finite(intensity))
# protein-matrix
protein_matrix = rollup_pep2prot(peptides_for_contrast, intensity_is_log2 = T, algo_rollup = "maxlfq", return_as_matrix = T)
# aligned sample metadata
protein_matrix__metadata = tibble(sample_id = colnames(protein_matrix)) %>% left_join(dataset$samples, by = "sample_id")


## design matrix
# !! adjust params; here we define the experimental design (literature reference, see PubMed ID: 33604029)
# !! importantly, these must ofc match column names in the `dataset$samples` table  (which was parsed from your input metadata provided earlier @ import_sample_metadata(...) )
# This example includes interaction effect (see printed model matrix @ next line)
contr_design = stats::model.matrix(~phenotype*drug, data = protein_matrix__metadata)
print(contr_design)


## limma eBayes
fit = limma::eBayes(limma::lmFit(protein_matrix, contr_design))

## DEqMS
# gather peptide-per-protein counts
prot_pep_count = peptides_for_contrast %>% distinct(protein_id, peptide_id) %>% count(protein_id, name = "peptides_used_for_dea")
# an aligned vector of counts; use match() to enforce the count vector is sorted exactly as the data matrix used to fit limma::ebayes()
fit$count = prot_pep_count$peptides_used_for_dea[match(rownames(protein_matrix), prot_pep_count$protein_id)]
# overwrite the fit object with DEqMS's
fit = suppressWarnings(DEqMS::spectraCounteBayes(fit))
# suppressWarnings(DEqMS::VarianceBoxplot(fit, n=20, main = "DEqMS QC plot", xlab="#unique peptides per protein"))
# QC; you have to provide a design matrix that yields some coefficients (beyond intercept)
stopifnot(ncol(fit$coefficients) > 1)

## extract stats for all coefficients
result = NULL
cols_coeff = tail(colnames(fit$coefficients), -1) # coefficients to extract. The first is the intercept, which we skip
for(col_coeff in cols_coeff) {
  # !! sort.by="none" keeps the output table aligned with input matrix
  fit_table = suppressMessages(limma::topTable(fit, number = nrow(protein_matrix), coef = col_coeff, adjust.method = "fdr", sort.by = "none", confint = TRUE))
  stopifnot(rownames(fit_table) == rownames(fit)) # because we didn't sort, tables are aligned
  fit_table$tstatistic = fit$sca.t[,col_coeff]
  fit_table$pvalue = fit$sca.p[,col_coeff]
  fit_table$qvalue = p.adjust(fit_table$pvalue, method = "fdr")
  ## fit$coefficients and fit$stdev.unscaled contain the intercept, remove from ES and SE computation
  # eBayes effect size: Cohen's d in limma, according to Gordon Smyth   https://support.bioconductor.org/p/71747/#71781
  fit_table$effectsize = fit$coefficients[,col_coeff] / sqrt(fit$sca.postvar)
  # eBayes standard error, according to Gordon Smyth   https://support.bioconductor.org/p/70175/
  fit_table$standarderror = sqrt(fit$sca.postvar) * fit$stdev.unscaled[,col_coeff]
  fit_table$standarddeviation = sqrt(fit$sca.postvar)

  ## create a fit_table tibble that contains all columns required for downstream compatability with this pipeline; protein_id, pvalue, qvalue, foldchange.log2, algo_de
  result = bind_rows(
    result,
    as_tibble(fit_table) %>%
      mutate(protein_id = rownames(fit_table)) %>%
      select(protein_id, pvalue, qvalue, foldchange.log2 = logFC, effectsize, tstatistic, standarddeviation, standarderror) %>%
      add_column(algo_de = "deqms_custom",
                 coefficient = col_coeff)
  )
}


## finally, add protein metadata to the statistical results
result = dataset$proteins %>% select(protein_id, fasta_headers, gene_symbols_or_id) %>% right_join(result, by = "protein_id")


# example plot; print pvalue histograms for all coefficients
ggplot(result, aes(pvalue)) + geom_histogram(fill = "white", colour = "black", bins = 50) + facet_wrap(.~coefficient, ncol = 1) + theme_bw()

design matrix

For reference, this is what the design matrix for my test dataset looked like this. below is the output of the above code print(contr_design)

   (Intercept) phenotypeWT drugCompoundX phenotypeWT:drugCompoundX
1            1           0       0                   0
2            1           0       0                   0
3            1           0       0                   0
4            1           0       0                   0
5            1           0       0                   0
6            1           0       0                   0
7            1           0       1                   0
8            1           0       1                   0
9            1           0       1                   0
10           1           0       1                   0
11           1           0       1                   0
12           1           0       1                   0
13           1           1       0                   0
14           1           1       0                   0
15           1           1       0                   0
16           1           1       0                   0
17           1           1       0                   0
18           1           1       0                   0
19           1           1       1                   1
20           1           1       1                   1
21           1           1       1                   1
22           1           1       1                   1
23           1           1       1                   1
24           1           1       1                   1

Hi Frank,
solution 2 worked perfectly for me. Thanks alot!

On a side note: is it possible to put a continuous covariate as random variable in the "standard" DE analysis implemented in MSDAP? I tried and received a limma error about "no residual DFs", which made me think that the numerical covariate is interpreted as a factor.

Kind regards
jgn301

solution 2 worked perfectly for me. Thanks alot!

Ok great!

On a side note: is it possible to put a continuous covariate as random variable in the "standard" DE analysis implemented in MSDAP? I tried and received a limma error about "no residual DFs", which made me think that the numerical covariate is interpreted as a factor.

Yea that's right, in the current implementation all random variables from sample metadata are treated as factors. You can use the code above to apply alternative experimental designs for now. I'm planning for MS-DAP updates to make it easier to run custom design matrices and include continous covariates, but I'm a little pressed for time atm so please use above codebase in the meanwhile.

I'll close this issue. If you have other questions or feedback, please feel free to open another ticket!