BiomedSciAI/causallib

ipw.estimate_population_outcome(Data_X, Data_a, Data_y) has some error.

Closed this issue · 2 comments

Hello.
I want to check the causal effect.

My code is

from causallib.estimation import IPW
from sklearn.linear_model import LogisticRegression

learner = LogisticRegression(penalty="l1", C= 1, max_iter=500, solver='liblinear')
ipw = IPW(learner)
ipw.fit(Data_X, Data_a)
potential_outcomes = ipw.estimate_population_outcome(Data_X, Data_a, Data_y) 
causal_effect = ipw.estimate_effect(potential_outcomes[1], potential_outcomes[0])
causal_effect

until ipw.fit(Data_X, Data_a), there is no problem.
But when I apply
potential_outcomes = ipw.estimate_population_outcome(Data_X, Data_a, Data_y)
there is an error "### /root/miniforge3/envs/python_3_9/lib/python3.9/site-packages/numpy/lib/function_base.py:550: RuntimeWarning: invalid value encountered in double_scalars
avg = np.multiply(a, wgt,"

and I check the ipw.compute_weights.
w = ipw.compute_weights(Data_X, Data_a)
This code doesn't have an error, but the values of w are all NaN.
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..

What is the problem?

Hi, thanks for raising this question.
The code looks ok, so my guess is that there is some problem with the data and/or modeling. The error you're getting is probably due to the IP-weights being NaNs. In turn, this could result because you have very extreme propensity scores (close to 0 or 1), so when you take their inverse you essentially divide by zero - for which Numpy returns a nan.

So first thing I would do is to see if your data is too separable, resulting in extreme probabilities to be treated. Note that if that's the case you might have a positivity/overlap violation, that might render your resulted associations not causal.
You can use causallib's evaluation tools for that:

from causallib.evaluation import evaluate
ipw_evaluation = evaluate(ipw, Data_X, Data_a, Data_y)
ipw_evaluation.plot_weight_distribution()

Alternatively, if you are confident in your data, you can probably solve the extreme weights by clipping/trimming the propensity scores.
You can do that by adding, for example, IPW(learner, clip_min=0.001, clip_max=0.999).
Note that if you'll also add verbose=True, you'll get a print saying the proportion of observations that were clipped. Keep an eye on that. Ideally, this number should be low - the larger it is the more observations are affected by the caliper value and not their model-estimated probability, and if you clip too many of your dataset than your actual modeling (here LASSO) has less of an impact.

ehudkr commented

I'll take this nonresponsiveness as a good sign :-)
I'm doing some housekeeping so I'm closing this issue for now.
Feel free to re-open if needed.