[BUG] `AalenAdditive` regressor predicts improper survival function
fkiraly opened this issue · 5 comments
The AalenAdditive
regressor predicts improper survival functions, i.e., functions that are not monotonous decreasing, or staying in the expected range [0,1]. Observed with lifelines 0.28.0
.
To reproduce:
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
X, y = load_diabetes(return_X_y=True, as_frame=True)
df = pd.concat([X, y], axis=1)
from lifelines.fitters.aalen_additive_fitter import AalenAdditiveFitter
aaf = AalenAdditiveFitter()
aaf.fit(df, duration_col="target")
y_pred_surv = aaf.predict_survival_function(df)
# not monotonous decreasing
np.sum(y_pred_surv.diff() > 0) # entries count increasing diff, should all be 0
# outside expected range [0, 1] # entries count strictly above 1, should all be 0
np.sum(y_pred_surv > 1)
This is mentioned here as an artifact of the model. My intuition for that is because of the additive form of the hazard function, rather multiplicative and exponentiated like Cox.
Hm, I wouldn't agree that this is a valid explanation, @bachnguyen-tomo.
I see the box in the documentatiton which makes the claim in alignment wiht your statement, but I'm not sure whether I agree to that. Why:
Any non-negative, integrable function with infinite integral is a valid hazard function - this can be seen from writing the survival function as
(this is a well-known proposition that relates the survial function and the hazard function/distribution)
So, no matter what the above equates to, as long as
In consequence of this theorem, there might be a bug?
But, I suppose this answers the more pragmatic question sufficiently, on whether this is something that people would expect to happen.
Given the note in the documentation, it seems that this is expected (in the social sense) behaviour of the algorithm, and in that sense, we could close this issue.
PS @bachnguyen-tomo, in case you have some input on what models that produce full distributions should do in this case, contribution here would be appreciated: sktime/skpro#249
@fkiraly The equation above assumes that the hazard function is non-negative though, which is the main drawback of the regressor, it doesn't guarantee non-negative hazard. First page.