Synthetic control has problems when the time series of interest is higher than the others
Closed this issue · 2 comments
We'll build off of the Synthetic Control example here. Using the "sc" dataset, you can create a new "actual" column by adding
df_example = cp.load_data("sc")
treatment_time = 70
df_example['actualplus20'] = df_example.actual+20
So this new series has a greater value than any other series at every point in the dataset.
Following the example:
result_ex = cp.pymc_experiments.SyntheticControl(
df_example,
treatment_time = 70,
formula="actualplus20 ~ 0 + a + b + c + d + e + f + g",
model=cp.pymc_models.WeightedSumFitter(
sample_kwargs={"target_accept": 0.95,}
),
)
This yields the following. Note how the model fit is consistently below the actuals. The scikit learn version also gives a similar result.
Hi @vishalthatsme. Yes, this is known / partially intentional behaviour. Because the synthetic control is modelled as a weighted sum of the control units, and the weightings sum to 1, then that particular model is constrained to 'interpolate'. That is, that kind of weighted sum can only produce synthetics controls between the bounds of the control units.
When the target unit is outside the bounds of the control units (above or below) then we would want the sum of the weights to be either above or below 1, respectively.
I believe the recommendation is that caution should be used when extrapolating beyond the bounds of the control units, but perhaps we should enable this functionality anyway.
I'll create a specific feature request issue.
Thanks @drbenvincent! I see that now here (https://matheusfacure.github.io/python-causality-handbook/15-Synthetic-Control.html#don-t-extrapolate), i'll need to spend more time reading up on this.