DoubleML/doubleml-for-py

[Bug]:

Nolan3036 opened this issue · 2 comments

Describe the bug

when I use my own data (three variables in D, four variables in X), and after that the predictions for both "ml_l", "ml_m" has shape (n_obs, iteration, number of variables in D), shouldn't it be (n_obs, iteration, 1) for "ml_l"?
Furthermore, if I see the shape of feature importance score of the model for both "ml_l", "ml_m", it is (6,), shouldn't it be (4,) in my case?
In your provided example, it works fine, also it only has one variable in D, so hard to debug, but you can reproduce it using my code.

I hope I don't miss anything but if I do please let me know thanks!

Minimum reproducible code snippet

test1=pd.DataFrame({
'd1': np.random.randn(100),
'd2': np.random.randn(100),
'd3': np.random.randn(100),
'x1': np.random.randn(100),
'x2': np.random.randn(100),
'x3': np.random.randn(100),
'x4': np.random.randn(100),
'y': np.random.randn(100)
})

obj_dml_data_from_df = DoubleMLData(test1, 'y', ["d1","d2","d3"])

ml_l=XGBRegressor(random_state=0)
ml_m=XGBRegressor(random_state=0)

dml_plr_obj = dml.DoubleMLPLR(obj_dml_data_from_df, ml_l, ml_m).fit(store_models=True)

print(dml_plr_obj.predictions["ml_l"].shape)
print(dml_plr_obj.predictions["ml_m"].shape)
print(dml_plr_obj.models["ml_l"]["d1"][0][0].feature_importances_.shape)
print(dml_plr_obj.models["ml_m"]["d1"][0][0].feature_importances_.shape)

Expected Result

(100, 1, 1)
(100, 1, 3)
(4,)
(4,)

Actual Result

(100, 1, 3)
(100, 1, 3)
(6,)
(6,)

Versions

Linux-5.4.0-150-generic-x86_64-with-glibc2.27
Python 3.10.9 (main, Jan 11 2023, 15:21:40) [GCC 11.2.0]
DoubleML 0.7.1
Scikit-Learn 1.0.2

This is intended as the model is generally switiching several features and treatments:
The partially linear model assumes the following form for a single treatment
$$Y=\theta_0 D + g_0(X) + \epsilon$$
which would generally extend to
$$Y=\theta_{0,1} D_1 + \theta_{0,2} D_2 + \theta_{0,3} D_3 + g_0(X) + \epsilon$$
for three treatments. Considering only the estimation of $\theta_{0,1}$, one could rewrite this as
$$Y=\theta_{0,1} D_1 + \tilde{g}_0(\tilde{X}) + \epsilon$$

with
$$\theta_{0,2} D_2 + \theta_{0,3} D_3 + g_0(X) =: \tilde{g}_0(\tilde{X}).$$

Then we have to fit the cond. expectation $\mathbb{E}[Y|\tilde{X}]$ for the learner ml_l.
Therefore ml_l depends on $6$ features instead of $4$. The same holds true for the other treatments.

I will close this issue since this is intended behavior