sahirbhatnagar/casebase

Currently, the fast version of absoluterisk does not work with cv.glmnet

Closed this issue · 3 comments

i've done a little digging and its this particular line
output[,j + 1] <- trap_int(knots, exp(pred))[knots %in% c(0, time_ordered)]
in absoluteRisk that ends up giving inf for every integration. The reason for this is that glmnet has time as the second variable, and the positional information is important in matrix format. here,
newdata2[,object$timeVar := knots]

added time as a variable at the end of the data frame. the order between betas and coefficients no longer match and we get this straight line (like before, but this issue happens in the main function that all models call now.)

One fix is to change the data format in the fitting function (fitsmoothhazard.fit) to match the expected output. another option is to make a separate estimate_risk_newtime function. Not sure which one you'd prefer so figured I would post it here!

library(ggplot2)
library(casebase)
#> See example usage at http://sahirbhatnagar.com/casebase/
library(survival)
#> Warning: package 'survival' was built under R version 3.6.2

completeData <- casebase::support
workingCompleteData <- model.matrix(death ~ . - d.time, data = completeData)[, -c(1)] # remove intercept
# Create u and xCox, which will be used when fitting Cox with glmnet.
newData <- workingCompleteData
x <- workingCompleteData
# x and y will be used to fit the casebase model under glmnet.
# x=as.matrix(sparse.model.matrix(death~ .-d.time+0,data=workingData))
y <- data.matrix(completeData[, c(2, 5)])

fullDataCox <- as.data.frame(cbind(as.numeric(y[, 2]), as.numeric(y[, 1]), x))
train_index <- sample(1:nrow(fullDataCox), 0.8 * nrow(fullDataCox))
test_index <- setdiff(1:nrow(fullDataCox), train_index)

train <- fullDataCox[train_index, ]
test <- fullDataCox[test_index, ]



#ratio <- 100 # universal ratio for TRUE

# x and y will be used to fit the casebase model under glmnet.
y <- data.matrix(train[, c(2, 1)])
colnames(y) <- c("death", "d.time")
x <- data.matrix(train[, -c(2, 1)])
newData <- data.matrix(test[, -c(2, 1)])

cbFull <- casebase::fitSmoothHazard.fit(x, y, family = "glmnet", time = "d.time", event = "death", formula_time = ~ log(d.time), alpha = 1, ratio = 1, standardize = TRUE)

# Estimating the absolute risk curve using the newData parameter.
abCbFull <- casebase::absoluteRisk(cbFull, time = seq(1, max(support$d.time), 25), newdata = newData, s = "lambda.1se", method = c("numerical"))

abCB <- data.frame(abCbFull[, 1], rowMeans(abCbFull[, -c(1)]))

colnames(abCB) <- c("Time", "Absolute risk")

plot(x=abCB$Time,y=abCB$`Absolute risk`,type="l")

Created on 2020-03-06 by the reprex package (v0.3.0)

I'll look into this. I'm not quite sure I agree with your comments:

i've done a little digging and its this particular line
output[,j + 1] <- trap_int(knots, exp(pred))[knots %in% c(0, time_ordered)]
in absoluteRisk that ends up giving inf for every integration. The reason for this is that glmnet has time as the second variable, and the positional information is important in matrix format. here,
newdata2[,object$timeVar := knots]

I don't think the problem can be in the line you mention, it must happen before, probably when we get the predictions pred. But your reprex definitely shows there is a problem, thanks!

For the record, here's a minimal reproducible example:

library(casebase)
#> See example usage at http://sahirbhatnagar.com/casebase/

x <- model.matrix(death ~ . - d.time - 1, data = support)
y <- with(support, cbind(death, d.time))

# x and y will be used to fit the casebase model under glmnet.
fit_cb <- casebase::fitSmoothHazard.fit(x, y, family = "glmnet", 
                                        time = "d.time", event = "death", 
                                        formula_time = ~ log(d.time), 
                                        alpha = 1, ratio = 5, standardize = TRUE)

# Estimating the absolute risk curve using the newData parameter.
abs_cb <- casebase::absoluteRisk(fit_cb, time = seq(1, max(support$d.time), 25), 
                                 newdata = x[1,,drop=FALSE], s = "lambda.1se", 
                                 method = "numerical")
plot(x = abs_cb[,"time"],
     y = abs_cb[,2],
     type = "l")

Created on 2020-03-06 by the reprex package (v0.3.0)

I've narrowed it down to how we deal with time. As @Jesse-Islam pointed out, we had a similar issue before and we fixed it, but now it's come back again:

library(casebase)
#> See example usage at http://sahirbhatnagar.com/casebase/

# With linear time----
fit_cb_lin <- casebase::fitSmoothHazard(death ~ d.time,
                                        time = "d.time", event = "death",
                                        data = support, ratio = 5)

abs_cb_lin <- casebase::absoluteRisk(fit_cb_lin, time = seq(1, max(support$d.time), 25),
                                     newdata = support[1,,drop=FALSE],
                                     method = "numerical")
plot(abs_cb_lin,
     type = "l")

# With log time----
fit_cb_log <- casebase::fitSmoothHazard(death ~ log(d.time),
                                        time = "d.time", event = "death",
                                        data = support, ratio = 5)

abs_cb_log <- casebase::absoluteRisk(fit_cb_log, time = seq(1, max(support$d.time), 25),
                                     newdata = support[1,,drop=FALSE], 
                                     method = "numerical")
plot(abs_cb_log,
     type = "l")

Created on 2020-03-06 by the reprex package (v0.3.0)

Note that the code above doesn't use glmnet, so the problem is much more general. And oddly enough, it seems restricted to log transformations (e.g. it works with splines).