use of `case.weights` versus `class.weights` in the case of a binary response?
mikoontz opened this issue · 0 comments
I noticed some unexpected behavior of the permutation importance in the case of a binary response variable when using a regression approach for the random forest model. Variables that were highly important based on other "importance" metrics (e.g., mean minimum tree depth, observing large differences in predicted value across a gradient of that metric, number of times a root, the cross-validated importance value I get by using spatialRF::rf_importance()
) were showing up as strongly negative in the standard $variable.importance
Some details
I built some {ranger} models directly to try to suss this out and think I've identified that this arises when treating a binary response as a regression problem.
My (naive) understanding is that the class.weights
argument of ranger()
is the best way to account for class imbalance given a binary (or other categorical) response. I believe that the {spatialRF} machinery (e.g., using spatialRF::case_weights()
) passes that information along to case.weights
instead of class.weights
I am having a hard time understanding how case.weights
and class.weights
are being used in ranger()
but the permutation importance when building a {ranger} model directly, having a binary response, and treating it as a classification problem (rather than regression) seems to track much better with the other measures of variable importance I listed above, which makes me suspect this is a fundamental issue that comes up when (inappropriately??) treating a binary response as a regression problem and using case.weights
to try to account for class imbalance.
Anyway, I'm still trying to read more to better understand the implications for building the model but I thought I'd flag it for now!
[edit: I'm pasting in some of my investigation code in case that's useful...]
plant_richness_df$response_binomial <- ifelse(
plant_richness_df$richness_species_vascular > 5000,
case.wgts <- spatialRF::case_weights(data = plant_richness_df, = "response_binomial")
predictor.variable.names <- colnames(plant_richness_df)[5:21]
# Regression problem with binary response and using case.weights
fm1 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = plant_richness_df[["response_binomial"]],
data = plant_richness_df,
classification = FALSE,
probability = FALSE,
case.weights = case.wgts,
importance = "permutation",
seed = 1)$variable.importance))
# Classification problem with a factor as response variable, and using case.weights
fm2 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = as.factor(plant_richness_df[["response_binomial"]]),
data = plant_richness_df,
classification = TRUE,
probability = FALSE,
case.weights = case.wgts,
importance = "permutation",
seed = 1)$variable.importance))
# Probability estimation problem with a factor as response variable, and using case.weights
fm3 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = as.factor(plant_richness_df[["response_binomial"]]),
data = plant_richness_df,
classification = FALSE,
probability = TRUE,
case.weights = case.wgts,
importance = "permutation",
seed = 1)$variable.importance))
# Probability estimation with a factor as response variable, and using class.weights
fm4 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = as.factor(plant_richness_df[["response_binomial"]]),
data = plant_richness_df,
classification = FALSE,
probability = TRUE,
class.weights = unique(case.wgts),
importance = "permutation",
seed = 1)$variable.importance))
# Probability estimation with a factor as response variable, and using both class.weights and case.weights
fm5 <- ranger::ranger(x = plant_richness_df[, predictor.variable.names],
y = as.factor(plant_richness_df[["response_binomial"]]),
data = plant_richness_df,
classification = FALSE,
probability = TRUE,
case.weights = case.wgts,
class.weights = unique(case.wgts),
importance = "permutation",
seed = 1)$variable.importance))
# spatialRF
fm6 <- spatialRF::rf(data = plant_richness_df, = "response_binomial",
predictor.variable.names = predictor.variable.names,
seed = 1)$variable.importance))$variable.importance))
# spatialRF
fm7 <- spatialRF::rf(data = plant_richness_df, = "response_binomial",
predictor.variable.names = predictor.variable.names,
seed = 1)$variable.importance)) # the {spatialRF} version creates the same model as fm1$variable.importance))