therneau/survival

potential issue with survfit function when using weights and offset

Closed this issue · 9 comments

Hi,

I built a model using weighted cox regression and I am recalibrating the model after shrinkage. For that, I need to recalibrate the baseline survival of the model, which requires running the survfit function with an offset and weights. Everything works as expected if I don't use any weights, but when I use weights the baseline survival does not seem correctly computed.

Here's an example of the problem, using the ovarian dataset:

Unweighted cox (works as expected)

library(survival)
# 1. Fit a cox model:
fit_cox=coxph(Surv(futime,fustat)~age+ecog.ps,data=ovarian)
lp = predict(fit_cox,type="lp")

# 2.1) Fit a cox model using the linear predictor directly. As expected, coefficient is 1
fit_lp =coxph(Surv(futime,fustat)~lp,data=ovarian)
# 2.2) Now, use linear predictor as offset
fit_offsetlp =coxph(Surv(futime,fustat)~offset(lp),data=ovarian)

# 3. Check that baseline survival from survfit is the same
summary(survfit(fit_lp),times = 800)$surv
summary(survfit(fit_offsetlp),times = 800)$surv

In this case, the baseline survival is 0.5212506 for both models, which is expected because the coefficient of the linear predictor is 1, so it is the same as having an offset with the linear predictor.

Repeat exercise, but with weighted cox:

# 0. Compute weights for cox model
set.seed(1) # so that weights can be reproduced
weights_cox_model = sample(c(1,20,15),size=nrow(ovarian),replace=TRUE)

# 1. Fit cox model
fit_wgh_cox=coxph(Surv(futime,fustat)~age+ecog.ps,data=ovarian,weights = weights_cox_model)
lp = predict(fit_wgh_cox,type="lp")

# 2.1) Fit a cox model using the linear predictor directly. As expected, coefficient is 1
fit_lp =coxph(Surv(futime,fustat)~lp,data=ovarian,weights=weights_cox_model)
# 2.2) Now, use linear predictor as offset
fit_offsetlp =coxph(Surv(futime,fustat)~offset(lp),data=ovarian,weights=weights_cox_model)

# 3) Check that baseline survival from survfit is the same -> THEY ARE NOT!
summary(survfit(fit_lp),times = 800)$surv
summary(survfit(fit_offsetlp),times = 800)$surv

Here, the baseline survival of the model directly fitted with the linear predictor is 0.81, while the baseline survival for the model with the offset is 0.71.

Is this the expected behaviour or is it a bug?

Thanks in advance!

I understand now what is happening, but I still think it is not the expected behaviour.

The difference between the two examples in the ovarian dataset is that the mean of the linear predictor in the unweighted cox regression is 0, while in the weighted cox is 0.475.

In the survival package, model coefficients of numerical variables should always be multiplied by the centered variable. Models fitted with offsets are no exception. So if we subtract the mean of the linear predictor we get the expected baseline survival.

mean_lp = mean(lp)
#Baseline survival with linear predictor:
summary(survfit(fit_lp),times = 800)$surv # 0.8075316
#Baseline survival with linear predictor + offset without considering centering
baseline_surv_no_centering = summary(survfit(fit_offsetlp),times = 800)$surv #0.7090529
# Baseline survival with linear predictor + considering centering
baseline_surv_no_centering^exp(-mean_lp) # 0.8075316

What makes this confusing is that in weighted cox regressions without offsets, the numerical variables are centered using the weighted mean. For some reason, it seems that when the offset is used, the package uses the unweighted mean.

Note that I have been deeply embroiled in some home renovation over the last weeks and have spent minimal time monitoring or thinking about the survival package. However, I will say that I don't consider "THE" baseline survival to be a concept that has any real utility. What is important, though, is correct predicted values when you specify the newdata argument. Do you have a case where that goes wrong?

Hi,

thanks for the update!

I ran into this problem because I wanted to report the coefficients of the recalibrated model + baseline survival, so that people can use them to compute survival probabilities for a patient at a specific timepoint t themselves (using the formula: baseline survival_t^exp(linear predictor after shrinkage).

I thought that computing the baseline survival at time t was the most straightforward way to report the model, but perhaps you have an alternative suggestion then?

I didn't use the predict function in this case, but to try to answer your question on using the predict function with newdata argument:

If I run:

predict(fit_lp,newdata = data.frame("lp"=c(0,2),"futime"=c(500,800),"fustat"=c(0,0)),type="survival") 

It runs for both weighted and unweighted cox regression.

The same command using the model with the offset:

predict(fit_offsetlp,newdata = data.frame("lp"=c(0,2),"futime"=c(500,800),"fustat"=c(0,0)),type="survival")

throws the error:
Error in newx %*% object$coef :
requires numeric/complex matrix/vector arguments

If type "lp" is used, all commands run fine. But this would be content for another github issue.

Thanks for the further input. An underlying issue which needs to be answered first is to decided what the predicted survival SHOULD be for a model that has an offset term. A problem is that offsets are used for several different things -- case-cohort designs for instance.

Can you explain what you mean by recalibration, and how that leads to : " For that, I need to recalibrate the baseline survival of the model, which requires running the survfit function with an offset and weights." What do you mean by recalibrate?

With respect to new predictions, as long as you don't need a standard error then log(S(t, z)) = log(S(t,x))* exp(beta (z-x)), i.e., you can use absolutely any x vector as the starting point; no need to correctly guess what survfit.coxph will do if you don't give it a covariate vector.

Prediction for a Cox model leads to some interesting cases when there is a newdata argument, namely what to do with strata, offset, and cluster (or id). An offset term can be used for many things, e.g., re-weighting of risk sets in a case-cohort design, for incorporating known coefficients from a prior model, and you are stating a third usage which I am unfamiliar with. For the first of these, any offset should be completely ignored when doing prediction, either linear predictor or survival curve. For the second, the new data will need to contain an appropriate per-subject offset. I need to understand what you are attempting to do, much more clearly, in order to go further in helping.

A second issue is a bad decision, made long ago, to allow a user to omit the newdata argument for a predicted survival curve, and give them the prediction for some "default" subject. The chosen default is a. rarely useful and b. people assume far too much about it, especially since c. since it isn't useful I have not worried much about the special cases (offset, weights, strata, cluster). If I could go back in time this would change as it has done more harm than good. But I had a lot less experience then.

Thanks for the clarification!

Indeed offset terms are used for different things.
In my case, what I mean by recalibrate is the following:

  1. I fitted a weighted cox regression model in a case-control dataset, because I'm interested in obtaining absolute risk predictions from the model.
$$S(t,z) = S0(t,x)^exp(betas*(z-x))$$
  1. I performed bootstraping to obtain the optimism corrected calibration slope. Since the calibration slope was a bit lower than 1, I used it in step 3 as shrinkage factor (see section 2.1.3 of reference 1).
  2. I multiplied each beta coefficient from the model fitted in step 1 by the shrinkage factor.
  3. To report the final recalibrated model, I re-estimated the baseline survival (S0_new) using again a weighted cox regression on the same case-control dataset, while keeping the shrunk linear predictor fixed:
$$S(t,z) = S0_new(t,x)^exp(offset(betas*shrinkage_factor*(z-x)))$$

What I found out after opening this github issue was that, in terms of implementation, I have to further correct the last equation by subtracting the mean of the shrunk linear predictor.

$$S(t,z) = S0_new(t,x)^exp(offset(betas*shrinkage_factor*(z-x))-mean(betas*shrinkage_factor*(z-x)))$$

I assumed that the new "x" vector that would be used as starting point in step 4 would be the weighted mean of the shrunk linear predictor (0), because that's what happens when we run a weighted cox model without offsets. As I explained in my second comment in this thread, this is not true because survfit with the offset parameter is using the mean instead of the weighted mean as starting point.

I hope this answers your question!

Refs:

1. Van Calster B, van Smeden M, De Cock B, Steyerberg EW. Regression shrinkage methods for clinical prediction models do not guarantee improved performance: Simulation study. Statistical Methods in Medical Research. 2020;29(11):3166-3178. doi:10.1177/0962280220921415

I think that the most reliable behavior is to do the following.
Say that fit1 <- coxph(Surv(time, status) ~ x1 + x2 + x3 + x4, data= mine) is the starting model, and that shrink = your shrinkage, which is some value < 1

Then do fit2 <- coxph(Surv(time, status) ~ x1 + x2 + x3 + x4, data=mine, iter=0, init = shrink* coef(fit1) )
Now get predicted survival curves from fit2.


Nevertheless, your examples make a good point. I've added a new test case to verify that a coxph model with + offset(zed) can include 'zed' in the newdata argument, and exercised some more care in what the default will be when someone flies blind by not including one. (And fixed one error, if the newdata had only an offset). I still maintain that "the" baseline hazard is a meaningless term; you need to specify a reference curve of your choice.
I'll also update the vignette.

Test added, help pages updated, small fix to function.
The final message: when using the formula H(t; x) = H(t; z) exp( beta * (x-z)), where H = cumulative hazard = log(survival), always create your "baseline" curve H(t;z) using a survfit call with newdata. Then you know what the baseline is. If the model had an offset term, include that variable in newdata too. Do not assume that you will correctly guess what comes back when you don't use the newdata argument --- documentation may sometimes be imperfect or incomplete.