SchlossLab/mikropml

Need function to compare models

Closed this issue · 5 comments

Would be nice to have a function that we can give two models and it outputs a p-value. @BTopcuoglu had something like this in her mBio paper repository. Here's what @courtneyarmour has modified for use with mikropml...

library(tidyverse)
library(mosaic)

# function to calculate pvalues
perm_p_value_cond <- function(data, cond_name_1, cond_name_2){
  # subset results to select AUC and condition columns and
  # filter to only the two conditions of interest
  test_subsample <- data %>%
    select(AUC,condition) %>%
    filter((condition == cond_name_1 | condition == cond_name_2))
  # quantify the absolute value of the difference in mean AUC between the two conditions (observed difference)
  obs <- abs(diff(mosaic::mean(AUC ~ condition, data = test_subsample)))
  # use mosaic to shuffle condition labels and calculate difference in mean AUC - repeat 10,000 times
  auc.null <- do(10000) * diff(mosaic::mean(AUC ~ shuffle(condition), data = test_subsample))
  n <- length(auc.null[,1]) #number of shuffled calculations
  # r = #replications at least as extreme as observed effect
  r <- sum(abs(auc.null[,1]) >= obs) #count number of shuffled differences as big or bigger than observed difference
  # compute Monte Carlo p-value with correction (Davison & Hinkley, 1997)
  p.value=(r+1)/(n+1)
  return(p.value)
}

# example input data
data <- read_csv("./example_data.csv")

# conditions to compare
conditions <- c("conditionA","conditionB","conditionC")

# generate table of all pairwise comparisons
p.table.condition <- NULL
p.table.condition <- expand_grid(x=1:length(conditions),
                                 y=1:length(conditions),
                                 ) %>%
  filter(x < y) %>%
  mutate(condition1 = conditions[x], condition2 = conditions[y]) %>%
  select(-x, -y) %>%
  group_by(condition1, condition2) %>%
  summarize(p_value = perm_p_value_cond(data,condition1,condition2),
            .groups="drop")
write_csv(p.table.condition,"./pvalues_by_condition.csv")

perm_p_value_cond is probably close to what we need. Is this in scope for mikropml?

Great, we should be able to adapt @courtneyarmour's code and incorporate it in the package. It sounds in-scope to me!

Do you want to take a stab at this @courtneyarmour? There are a few things that should probably be changed in this code before incorporation into mikropml:

  • If you write a few unit tests with input and expected output, that would make it easier for anyone to work on this.
  • It'd be ideal if it didn't add a new dependency (mosaic). In get_feature_importance(), we implemented a permutation test with future.apply functions, so this could use a similar approach to allow for parallel processing.
  • Need to use package::function() syntax when referring to functions except for those in mikropml or base R.
  • No magic numbers. There should probably be a parameter called nperms with a default value of 10000.
  • We mostly adhere to the tidyverse styleguide, so:
    • Underscores (_) are preferred over dots (.) to separate words in variable names.
    • Variable names should be descriptive. Only in rare cases are single-letter variable names okay.
    • Function names should start with a verb.
    • Use tidyverse functions instead of base R when working with data frames.
    • Avoid using reserved words like data.

sounds good, I'll give it a go

@kelly-sovacool should this go in a new R script or is there an existing script it can be added to?

@courtneyarmour I would probably put it in a new R script. Maybe compare_models.R? They're just named & organized topically.