tlverse/tmle3

CV-TMLE vs TMLE

olivierlabayle opened this issue · 0 comments

Hello,

I am following the tutorial and trying to look at the difference between CV-TMLE and TMLE with the perinatal dataset.

perinatal.csv

To keep things simple I only use a glm as the model for both the propensity score and the outcome mean. I am surprised to see that the output is exactly the same for both procedures. The CV-TMLE seems to complain about glm not being "CV-aware" which might be the reason. However I don't understand why that should be the case. My understanding of CV-TMLE is that:

  • The dataset should be splitted in V folds
  • The glm models (for both A and Y) should be fitted on each split, so we should have V instantiations of each glm each trained on a different split.
  • The targeting step is pooled from predictions of the V glm model pairs on their respective validation sets
  • The final estimate is the average of estimates across validation folds
  • The influence curve (I am not entirely sure if it is pooled across validation samples or if multiple variance estimates are made and averaged)

As I understand it, we could have used a Super Learning instead of a GLM which would have resulted in another nested cross-validation procedure but Super Learning is not a requirement of CV-TMLE. The code to reproduce is below: you can tweak the learner_list to change to a super learner and then 2 different outputs are returned and no "CV-aware" complaint is formulated.

I would appreciate some clarification on the procedure and why this is happening! Thanks!

library(data.table)
library(tmle3)
library(sl3)

data = read.csv("perinatal.csv")

node_list <- list(
  W = c(
    "apgar1", "apgar5", "gagebrth", "mage", "meducyrs", "sexn"
  ),
  A = "parity01",
  Y = "haz01"
)

glm = Lrnr_glm$new()
lrn_mean = Lrnr_mean$new()
sl <- Lrnr_sl$new(learners = Stack$new(glm, lrn_mean), metalearner = Lrnr_nnls$new())

learner_list <- list(A = glm, Y = glm)
# learner_list = list(A=sl, Y = sl)

ate_spec <- tmle_ATE(
  treatment_level = 1,
  control_level = 0
)

tmle_task <- ate_spec$make_tmle_task(data, node_list)
initial_likelihood <- ate_spec$make_initial_likelihood(
  tmle_task,
  learner_list
)


targeted_likelihood_cv <- Targeted_Likelihood$new(initial_likelihood)

targeted_likelihood_no_cv <-
  Targeted_Likelihood$new(initial_likelihood,
    updater = list(cvtmle = FALSE)
  )

tmle_params_cv <- ate_spec$make_params(tmle_task, targeted_likelihood_cv)
tmle_params_no_cv <- ate_spec$make_params(tmle_task, targeted_likelihood_no_cv)

tmle_no_cv <- fit_tmle3(
  tmle_task, targeted_likelihood_no_cv, tmle_params_no_cv,
  targeted_likelihood_no_cv$updater
)
tmle_no_cv
# -0.1855909

tmle_cv <- fit_tmle3(
  tmle_task, targeted_likelihood_cv, tmle_params_cv,
  targeted_likelihood_cv$updater
)
tmle_cv
# -0.1855909