DoubleML/doubleml-for-py

[Bug]: Multi-treatment data creation bug

AlejandroTL opened this issue · 2 comments

Describe the bug

If one want to create an object for a multi-treatment problem, in which each time just a 1-dimensional parameter theta_j for the treatment j is predicted including the rest of treatments /j with the set of covariates X, it outputs an error asking for the use of the option use_other_treat_as_covariate even though it is by default True.

Minimum reproducible code snippet

import numpy as np
import pandas as pd
import doubleml as dml

from doubleml.datasets import fetch_401K

dtypes = data.dtypes
dtypes['nifa'] = 'float64'
dtypes['net_tfa'] = 'float64'
dtypes['tw'] = 'float64'
dtypes['inc'] = 'float64'
data = data.astype(dtypes)

features_base = ['age', 'inc', 'educ', 'fsize', 'marr',
                 'twoearn', 'db', 'pira', 'hown']

# Initialize DoubleMLData (data-backend of DoubleML)
data_dml_base = dml.DoubleMLData(data,
                                 y_col='net_tfa',
                                 d_cols=['e401', 'pira'],
                                 x_cols=features_base,
                                 use_other_treat_as_covariate=True)

Expected Result

I would expect a successful creation of the data object.

Actual Result


ValueError Traceback (most recent call last)
Cell In[6], line 5
1 features_base = ['age', 'inc', 'educ', 'fsize', 'marr',
2 'twoearn', 'db', 'pira', 'hown']
4 # Initialize DoubleMLData (data-backend of DoubleML)
----> 5 data_dml_base = dml.DoubleMLData(data,
6 y_col='net_tfa',
7 d_cols=['e401', 'pira'],
8 x_cols=features_base,
9 use_other_treat_as_covariate=True)

File ~/first_env/lib/python3.8/site-packages/doubleml/double_ml_data.py:151, in DoubleMLData.init(self, data, y_col, d_cols, x_cols, z_cols, t_col, use_other_treat_as_covariate, force_all_x_finite)
149 self.t_col = t_col
150 self.x_cols = x_cols
--> 151 self._check_disjoint_sets_y_d_x_z_t()
152 self.use_other_treat_as_covariate = use_other_treat_as_covariate
153 self.force_all_x_finite = force_all_x_finite

File ~/first_env/lib/python3.8/site-packages/doubleml/double_ml_data.py:634, in DoubleMLData._check_disjoint_sets_y_d_x_z_t(self)
631 # note that the line xd_list = self.x_cols + self.d_cols in method set_x_d needs adaption if an intersection of
632 # x_cols and d_cols as allowed (see https://github.com/DoubleML/doubleml-for-py/issues/83)%3C/span%3E)
633 if not d_cols_set.isdisjoint(x_cols_set):
--> 634 raise ValueError('At least one variable/column is set as treatment variable (d_cols) and as covariate'
635 '(x_cols). Consider using parameter use_other_treat_as_covariate.')
637 if self.z_cols is not None:
638 z_cols_set = set(self.z_cols)

ValueError: At least one variable/column is set as treatment variable (d_cols) and as covariate(x_cols). Consider using parameter use_other_treat_as_covariate.

Versions

Linux-4.15.0-194-generic-x86_64-with-glibc2.17
Python 3.8.2 (default, Feb 26 2020, 14:31:49)
[GCC 6.3.0 20170516]
DoubleML 0.6.1
Scikit-Learn 1.2.2

Thank you for reporting this.
To fix your issue you simply have to exclude all treatment variables from the covariates.

import numpy as np
import pandas as pd
import doubleml as dml

from doubleml.datasets import fetch_401K
data = fetch_401K(return_type='DataFrame')

dtypes = data.dtypes
dtypes['nifa'] = 'float64'
dtypes['net_tfa'] = 'float64'
dtypes['tw'] = 'float64'
dtypes['inc'] = 'float64'
data = data.astype(dtypes)

features_base = ['age', 'inc', 'educ', 'fsize', 'marr',
                 'twoearn', 'db', 'hown']

# Initialize DoubleMLData (data-backend of DoubleML)
data_dml_base = dml.DoubleMLData(data,
                                 y_col='net_tfa',
                                 d_cols=['e401', 'pira'],
                                 x_cols=features_base,
                                 use_other_treat_as_covariate=True)

The option use_other_treat_as_covariate is used to internally define treatment and covariates.
For each treatment the covariates are adjusted, see

self._dml_data.set_x_d(self._dml_data.d_cols[i_d])
and
def set_x_d(self, treatment_var):

To show that this is working as inteded you can extend your example with

from sklearn.linear_model import LinearRegression
dml_plr_obj = dml.DoubleMLPLR(data_dml_base, ml_l=LinearRegression(), ml_m=LinearRegression())
dml_plr_obj.fit(store_models=True)

# consider the model for the outcome with treatment 'e401', first repetition, first fold
single_linear_model = dml_plr_obj.models['ml_l']['e401'][0][0]
print(single_linear_model.n_features_in_)

This should return 9 since the base features plus 'pira' since it is used as a covariate for the treatment 'e401'.
Further, if you want to exactly check this you can verify the predictions (but you only can compare the out-of-sample predictions of the cross-fitting)

# create the corresponding subsample
subsample = dml_plr_obj.smpls[0][0][1]
df_subsample = data[features_base + ['pira']].iloc[subsample]
print(single_linear_model.predict(df_subsample))
print(dml_plr_obj.predictions['ml_l'][:, 0, 0][subsample])

Both predictions should be equal.

I hope this could solve your issue. I will close this issue now.