Error of crossfit folds splits with DynamicDML
juandavidgutier opened this issue · 2 comments
Hi,
I am estimating the effect of high levels of particulate matter (PM2.5) on excess deaths from panel data for 25 municipalities with daily resolution. It means my treatment is a binary variable where T=1, when the level of PM2.5 is high, and T=0, when the level of PM2.5 is low. The outcome is also a binary variable, where Y=0 for non-excess deaths, and Y=1 for excess deaths.
I am using the class DynamicDML to fit my model, but I get this error message: "AttributeError: Provided crossfit folds contain training splits that don't contain all treatments". But, 50% of the data corresponds to observations with T=1, I think it is enough to obtain balanced crossfit folds.
Here is my code with econml version 0.15 and dowhy version 0.10.1
dataset_pm_deaths.csv
`
import dowhy
import econml
from dowhy import CausalModel
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
import scipy.stats as stats
from itertools import product
from econml.utilities import WeightedModelWrapper
from sklearn.model_selection import train_test_split
from econml.panel.dml import DynamicDML
data_all = pd.read_csv("D:/dataset_pm_deaths.csv")
data = data_all[data_all['Year'] >= 2009]
median_pm25 = data['PM25'].median()
data['PM25'] = (data['PM25'] >= median_pm25).astype(int)
data.BC = stats.zscore(data.BC, nan_policy='omit')
data.DMS = stats.zscore(data.DMS, nan_policy='omit')
data.PM = stats.zscore(data.PM, nan_policy='omit')
data.OC = stats.zscore(data.OC, nan_policy='omit')
data.SO2 = stats.zscore(data.SO2, nan_policy='omit')
data.SO4 = stats.zscore(data.SO4, nan_policy='omit')
data0 = data[['excess', 'PM25', 'cod_munici',
'BC', 'DMS', 'PM', 'OC', 'SO2', 'SO4', 'Temperature', 'lead1_PM25']]
data0 = data0.dropna()
Y = data0.excess.to_numpy()
T = data0.PM25.to_numpy()
percentage_high_PM25 = np.mean(T == 1) * 100
W = data0[['BC', 'DMS', 'PM', 'OC', 'SO2', 'SO4', 'Temperature']].to_numpy().reshape(-1, 7)
X = data0[['Temperature', 'lead1_PM25']].to_numpy().reshape(-1, 2)
groups = data0.cod_munici.to_numpy()
estimate0 = DynamicDML(discrete_treatment=True,
featurizer=PolynomialFeatures(degree=3),
linear_first_stages=False, cv=3, random_state=123)
estimate0.fit(Y=Y, T=T, X=X, W=W, inference='auto', groups=groups) # HERE IS THE ERROR
`
Have you tried passing a StratifiedKFold-object or creating your own cv-splitter? That could help you out in the meantime
Hi @TimCosemans
Thanks for your suggestions!