[Bug]: Multi-treatment data creation bug
AlejandroTL opened this issue · 2 comments
Describe the bug
If one want to create an object for a multi-treatment problem, in which each time just a 1-dimensional parameter theta_j
for the treatment j
is predicted including the rest of treatments /j
with the set of covariates X
, it outputs an error asking for the use of the option use_other_treat_as_covariate
even though it is by default True
.
Minimum reproducible code snippet
import numpy as np
import pandas as pd
import doubleml as dml
from doubleml.datasets import fetch_401K
dtypes = data.dtypes
dtypes['nifa'] = 'float64'
dtypes['net_tfa'] = 'float64'
dtypes['tw'] = 'float64'
dtypes['inc'] = 'float64'
data = data.astype(dtypes)
features_base = ['age', 'inc', 'educ', 'fsize', 'marr',
'twoearn', 'db', 'pira', 'hown']
# Initialize DoubleMLData (data-backend of DoubleML)
data_dml_base = dml.DoubleMLData(data,
y_col='net_tfa',
d_cols=['e401', 'pira'],
x_cols=features_base,
use_other_treat_as_covariate=True)
Expected Result
I would expect a successful creation of the data object.
Actual Result
ValueError Traceback (most recent call last)
Cell In[6], line 5
1 features_base = ['age', 'inc', 'educ', 'fsize', 'marr',
2 'twoearn', 'db', 'pira', 'hown']
4 # Initialize DoubleMLData (data-backend of DoubleML)
----> 5 data_dml_base = dml.DoubleMLData(data,
6 y_col='net_tfa',
7 d_cols=['e401', 'pira'],
8 x_cols=features_base,
9 use_other_treat_as_covariate=True)
File ~/first_env/lib/python3.8/site-packages/doubleml/double_ml_data.py:151, in DoubleMLData.init(self, data, y_col, d_cols, x_cols, z_cols, t_col, use_other_treat_as_covariate, force_all_x_finite)
149 self.t_col = t_col
150 self.x_cols = x_cols
--> 151 self._check_disjoint_sets_y_d_x_z_t()
152 self.use_other_treat_as_covariate = use_other_treat_as_covariate
153 self.force_all_x_finite = force_all_x_finite
File ~/first_env/lib/python3.8/site-packages/doubleml/double_ml_data.py:634, in DoubleMLData._check_disjoint_sets_y_d_x_z_t(self)
631 # note that the line xd_list = self.x_cols + self.d_cols in method set_x_d needs adaption if an intersection of
632 # x_cols and d_cols as allowed (see https://github.com/DoubleML/doubleml-for-py/issues/83)%3C/span%3E)
633 if not d_cols_set.isdisjoint(x_cols_set):
--> 634 raise ValueError('At least one variable/column is set as treatment variable (d_cols
) and as covariate'
635 '(x_cols
). Consider using parameter use_other_treat_as_covariate
.')
637 if self.z_cols is not None:
638 z_cols_set = set(self.z_cols)
ValueError: At least one variable/column is set as treatment variable (d_cols
) and as covariate(x_cols
). Consider using parameter use_other_treat_as_covariate
.
Versions
Linux-4.15.0-194-generic-x86_64-with-glibc2.17
Python 3.8.2 (default, Feb 26 2020, 14:31:49)
[GCC 6.3.0 20170516]
DoubleML 0.6.1
Scikit-Learn 1.2.2
Thank you for reporting this.
To fix your issue you simply have to exclude all treatment variables from the covariates.
import numpy as np
import pandas as pd
import doubleml as dml
from doubleml.datasets import fetch_401K
data = fetch_401K(return_type='DataFrame')
dtypes = data.dtypes
dtypes['nifa'] = 'float64'
dtypes['net_tfa'] = 'float64'
dtypes['tw'] = 'float64'
dtypes['inc'] = 'float64'
data = data.astype(dtypes)
features_base = ['age', 'inc', 'educ', 'fsize', 'marr',
'twoearn', 'db', 'hown']
# Initialize DoubleMLData (data-backend of DoubleML)
data_dml_base = dml.DoubleMLData(data,
y_col='net_tfa',
d_cols=['e401', 'pira'],
x_cols=features_base,
use_other_treat_as_covariate=True)
The option use_other_treat_as_covariate
is used to internally define treatment and covariates.
For each treatment the covariates are adjusted, see
doubleml-for-py/doubleml/double_ml.py
Line 514 in 4ced5e3
doubleml-for-py/doubleml/double_ml_data.py
Line 570 in 4ced5e3
To show that this is working as inteded you can extend your example with
from sklearn.linear_model import LinearRegression
dml_plr_obj = dml.DoubleMLPLR(data_dml_base, ml_l=LinearRegression(), ml_m=LinearRegression())
dml_plr_obj.fit(store_models=True)
# consider the model for the outcome with treatment 'e401', first repetition, first fold
single_linear_model = dml_plr_obj.models['ml_l']['e401'][0][0]
print(single_linear_model.n_features_in_)
This should return 9
since the base features plus 'pira'
since it is used as a covariate for the treatment 'e401'
.
Further, if you want to exactly check this you can verify the predictions (but you only can compare the out-of-sample predictions of the cross-fitting)
# create the corresponding subsample
subsample = dml_plr_obj.smpls[0][0][1]
df_subsample = data[features_base + ['pira']].iloc[subsample]
print(single_linear_model.predict(df_subsample))
print(dml_plr_obj.predictions['ml_l'][:, 0, 0][subsample])
Both predictions should be equal.
I hope this could solve your issue. I will close this issue now.