pymc-labs/CausalPy

Get model results data

Opened this issue ยท 13 comments

Is there a way to access the synthetic control fitted values as a pandas series or list? I need to create my own custom plots in a specific style, so I need access to the data for the fitted values and the confidence interval.

Hi. Yes this is definitely possible. You can access the raw MCMC samples in result.idata. Sounds like you'll want the posterior predictive samples. You can then calculate the mean and use arviz to calculate the HDI's then export to a csv for example.

Is that enough to go on, or are you requesting this as a utility function?

Thank you for the answer. I'm sorry for asking basic questions, I'm not very experienced in programming. From what I understand, the fitted model data is in posterior_predictive.y_hat? But then I get all the MCMC samples? The one that is plotted by default is the last one? What about the credible interval that is also plotted? How can I access it? Essentially I'm trying to recreate the first plot that appears in the default result.plot method, which is the only one I need, but I need to change the style for a specific one, which is why i'm trying to get the data.

No problem. I'll consider this a feature request and see if I (or someone else) can have a go at implementing it.

Thank you very much. That would be greatly helpful

Hello there! I have been lurking around here for some time now and I feel this could be my time to shine! I would very much like to contribute and, if I'm not mistaken, this could be an easy enough task to tackle as a first one.

In any case, I'll try and think about the proper way to do so but please let me know should you have any suggestion @drbenvincent. Thank you!

Sounds great @lpoug

I think it would be really simple to do on an experiment by experiment basic. But it's worth thinking if we can come up with a generic method that would work across all experiments. If it's doable we should try to do that.

What do you think? Happy to discuss more, but thought I'd reply quickly to keep the ball rolling.

Hey @drbenvincent!

I have given some thoughts about this, here they are.

I'm wondering if there is a use for such a functionality for all experiments. I can clearly see it for the prepostfit experiments as I have encountered the same issue when doing synthetic control (i.e. to reproduce the graphs in the Making Inference part of this chapter of Causal Inference for the Brave and True. I'm not so sure about the other ones and, to be honest, my knowledge of the other experiments is not as sharp as the one for SC.

With that being said:

  • If there is only a use for the PrePost Fit experiment, is it worth implementing it instead of writing a couple lines of code to retrieve the data?
    Knowing that the data already exists in self.pre_pred, self.post_pred, self.pre_impact, self.post_impact, etc.
  • If yes, should the function only exists in the PrePostFit class, with a conditional execution based on isinstance(self.model, PyMCModel) and isinstance(self.model, RegressorMixin)? Or...
  • Should it be implemented following the same global logic that applies to the .plot() function? I.e. .plot() that dispaches to either .bayesian_plot() or ols_plot() in base.py, then specific functions in experiments classes.

Hope this was clear enough, let me know what you think! In the mean time, I'll try to dive deeper into the other experiments.

Thanks @lpoug
I think because we've only had one request for this so far, we can probably just implement for the PrePostFit class, so synthetic control and interrupted time series.

And perhaps right now we could just create a utility function (that lives in plot_utils.py) rather than adding it in as a method. So the API would be something like:

from causalpy import export_plot_data


export_plot_data(result, target_filepath)

I guess we can add some checks on the inputs being provided. Looks like all the data needed for reproducing those plots could be saved in a csv file. The only exception would be the intervention date. So it could be worth considering exporting a yaml file which can do nested data and lists etc. But totally happy to leave those kind of details to you - that's just a thought I had.

Or, rather than saving, the API could be

intervention_date, plot_data = export_plot_data(result)

Maybe that's a simpler place to start and addresses @Guilherme-dL 's original request.

Thanks for the suggestions @drbenvincent!

I have done all the steps in the CONTRIBUTING.md to set up my local environment without any trouble, which is already a success for me ๐Ÿ˜„

Some thoughts on your suggestions:
I do think that the better usage would be to assign the function to a plot_data variable rather than directly saving it to a .csv. At least this has been my case. The function could return a pandas dataframe that is basically the dataframe provided in the data argument + new assigned columns predictions and impact. I might be biased so happy to be challengend on this!

With respect to the intervention date, I think that in most cases it already exists in a variable within the notebook or script (and if not, it is quite easily accessible in self.treatment_time, so it might be redundant to return it through the function.

Finally, regarding the location of the function, tbh I'm happy to follow your suggestion at this point. That being said, I'll start working on it to see if my opinion evolves.

Sounds good. All I'd say is that the function name shouldn't start with plot, it should really be a verb that describes what the function does. So something like get_plot_data?

lpoug commented

Hi @drbenvincent!

You can find below what I have done so far. I do have a couple more questions before opening a PR if that's ok with you. In fact, I have tried implementing the function in plot_utils.py and directly within the PrePostFit class. In both cases, the function returns the dataset initially provided in the experiment (available through self.datapre and self.datapost) with 2 additional columns prediction and impact.

  1. In plot_utils.py
def get_prepostfit_data(result) -> pd.DataFrame:
    """
    Utility function to recover the data of a PrePostFit experiment along with the prediction and causal impact information.

    :param result:
        The result of a PrePostFit experiment
    """

    from causalpy.experiments.prepostfit import PrePostFit
    from causalpy.pymc_models import PyMCModel

    if isinstance(result, PrePostFit):
        pre_data = result.datapre.copy()
        post_data = result.datapost.copy()

        if isinstance(result.model, PyMCModel):
            pre_data["prediction"] = (
                az.extract(
                    result.pre_pred, group="posterior_predictive", var_names="mu"
                )
                .mean("sample")
                .values
            )
            post_data["prediction"] = (
                az.extract(
                    result.post_pred, group="posterior_predictive", var_names="mu"
                )
                .mean("sample")
                .values
            )
            pre_data["impact"] = result.pre_impact.mean(dim=["chain", "draw"]).values
            post_data["impact"] = result.post_impact.mean(dim=["chain", "draw"]).values

        elif isinstance(result.model, RegressorMixin):
            pre_data["prediction"] = result.pre_pred
            post_data["prediction"] = result.post_pred
            pre_data["impact"] = result.pre_impact
            post_data["impact"] = result.post_impact

        else:
            raise ValueError("Other model types are not supported")

        ppf_data = pd.concat([pre_data, post_data])

    else:
        raise ValueError("Other experiments are not supported")

    return ppf_data
  1. Within the PrePostFit class
    def get_plot_data(self) -> pd.DataFrame:
        """Recover the data of a PrePostFit experiment along with the prediction and causal impact information.

        Internally, this function dispatches to either `get_plot_data_bayesian` or `get_plot_data_ols`
        depending on the model type.
        """
        if isinstance(self.model, PyMCModel):
            return self.get_plot_data_bayesian()
        elif isinstance(self.model, RegressorMixin):
            return self.get_plot_data_ols()
        else:
            raise ValueError("Unsupported model type")

    def get_plot_data_bayesian(self) -> pd.DataFrame:
        """
        Recover the data of a PrePostFit experiment along with the prediction and causal impact information.
        """
        if isinstance(self.model, PyMCModel):
            pre_data = self.datapre.copy()
            post_data = self.datapost.copy()
            pre_data["prediction"] = (
                az.extract(
                    self.pre_pred, group="posterior_predictive", var_names="mu"
                )
                .mean("sample")
                .values
            )
            post_data["prediction"] = (
                az.extract(
                    self.post_pred, group="posterior_predictive", var_names="mu"
                )
                .mean("sample")
                .values
            )
            pre_data["impact"] = self.pre_impact.mean(dim=["chain", "draw"]).values
            post_data["impact"] = self.post_impact.mean(dim=["chain", "draw"]).values
            
            self.data_plot = pd.concat([pre_data, post_data])

            return self.data_plot
        else:
            raise ValueError("Unsupported model type")
    
    def get_plot_data_ols(self) -> pd.DataFrame:
        """
        Recover the data of a PrePostFit experiment along with the prediction and causal impact information.
        """
        pre_data = self.datapre.copy()
        post_data = self.datapost.copy()
        pre_data["prediction"] = self.pre_pred
        post_data["prediction"] = self.post_pred
        pre_data["impact"] = self.pre_impact
        post_data["impact"] = self.post_impact
        self.data_plot = pd.concat([pre_data, post_data])

        return self.data_plot

I think there is a point to be made to have the function within the PrePostFit class in order to avoid any confusion as to its potential use for other experiments' classes.
What do you think?
If so, I tried to reproduce the logic of a global function that dispatches to either a bayesian or ols version (similar to plot()).
Maybe it's too much and a single function get_plot_data() that handles both cases would be sufficient for that particular purpose?

Let me know what you think when you have some time ๐Ÿ˜ƒ

Hi @lpoug

I think having separate get_plot_data_ols and get_plot_data_bayesian functions make sense. The underlying data is stored/represented in different ways, so has to be handled differently.

The get_plot_data_bayesian might also want to get the HDI's, but maybe that was on your roadmap.

I think I agree, having get_plot_data in the PrePostFit class might be better. It could be good to have a corresponding method in the BaseExperiment class which by default gives a NotImplementedError. That would then be overridden by what you define in the PrePostFit.