arviz-devs/arviz

Add unconstrained/transformed variables to InferenceData

ColCarroll opened this issue · 12 comments

@ahartikainen suggested adding unconstrained variables to the schema somewhere. It seems like they could go

  1. Make new unconstrained_posterior group
    Good:
    • Explicit is better than implicit: it is clear what this group is for
      Bad:
    • Should there also be unconstrained_prior?
  2. Add them to the existing posterior group
    Good:
    • Can reuse existing coords/dims
    • Can add to prior in the same way
      Bad:
    • Would need a naming convention, which we've managed to avoid so far
    • PyMC3 currently does this, and it is kind of a pain to have a flag everywhere to include or exclude these variables.
  3. Don't include them at all
    Good:
    • No work :)
      Bad:
    • Bounds and domain information are not captured anywhere currently

I like option 3! :-)

If the main purpose of including unconstrained variables is to explore things like divergences, then option 1 becomes the best option as there is no motivation (other than consistency) to add unconstrained_prior.

BTW, shouldn't the bounds and domain information be part of a variable type (or something similar), currently we only have this for discrete and continuous variables (because we have integers and floats types). This will be super useful for circular variables as we can automatically compute proper statistics and plots for such variables.

I think option 2 could be the easiest. If we add an option to transform everything to unconstrained space if a user wants it. That way we do not really need to add anything to the schema. Unless there is a notion on metadata somewhere.

I don't think users normally would need these values.

Vote for option 2 for the convenience of backward compatibility with pymc3 😬

dfm commented

I just came across this because I often find myself wanting access to the transformed variables in a PyMC3 trace. I often need this in order to evaluate model components after sampling (for example, I often want to plot the model prediction on a fine grid that is too large/memory intensive to save as a deterministic, but is fine to compute at a few points in a thinned chain). Right now, it looks like it would be easy enough to allow opt-in support for PyMC3 here:

var_names = self.pymc3.util.get_default_varnames(
self.trace.varnames, include_transformed=False
)

since include_transformed is hard-coded to False. I'd be keen to make this a parameter of the PyMC3Converter to at least have it as an option (this would be option 2 from above). Any feelings about this?

hey dan! I think I'm generally on team option 3 still, but also on team "lets help out science". So I'm totally on board with you adding a little backdoor option to allow this on pymc3, and if you do, maybe I can open a second issue requesting help to add it to various other backends (or even rethinking the design)?

Do we want transformed vars into the main idata or a new idata containing transformed data?

Or new groups?

I think if present they should be in the posterior, I have done this for now whenever I need to transform variables, and we have some examples on idata methods like map that transform variables post sampling and store them in the same group the original variable was (some transforms are also rescaling observations+posterior predictive at the same time, not only posterior vars)

I would also like to see a way to get the transformed variables in the trace. The fact that this isn't possible right now is pretty much the only reason I still use the original trace object of pymc3 quite a bit. I think I might even prefer if one of option 1 or 2 was the default behavior.
The most common use cases I can see would be

  • When diagnosing convergence issues the untransformed space often doesn't tell you much. If the transformation is just a log transform, that is easy to fix, but say for a Dirichlet this isn't the case anymore. I think we should always think in the transformed space when making scatter plots of the posteriors, looking for funnels, correlations etc.
  • Anything that uses the logp has to also take into account which space we are talking about. The logp that is stored in the trace is with respect to the unconstrained space, so it doesn't even make much sense when you don't know what this is.
  • If we interact with the model a bit more, because for example we want to also look at the gradients of the logp (this can tell us something about correlations), or if we want to compute sensitivities with respect to some other variable, we need to interact with the aesara graph of the logp function. This is defined in terms of the transformed variables though. Getting at the gradients is easy with the original trace:
func = model.logp_dlogp_function()
func.set_extra_values({})
for point in trace.points():
    logp, grad = func(func.dict_to_array(point))

This is a lot more work if if we do not have the pymc3 trace, because we first have to manually map the values to the transformed space.

I don't generally use unconstrained variables, but I think we should reach a decision on this and imo option 3 which is what we defaulted after not reaching a consensus is generating too much friction and extra work (like the complications in #2040 cc @karink520) to be a viable option.

I think someone should write a proposal and call for a vote. I offer to write the proposal at some point next week if nobody can do it before then.

If I do have to write the proposal, as I am not very familiar with the issue it would be great to know what PyMC and Stan do on this and what are the general expectations on this:

  • Do both Stan and PyMC allow returning transformed variables in the output with some "addendum" to the variable name?
  • I think most users don't care about this. So we'd want to exclude those by default. Is that something that you generally do on a model/run basis (i.e. we'd default to exclude those in the conversion directly unless requested) or something you generally decide on a plot/summary basis?

also cc @sethaxen and @mitzimorris who might be interested in the discussion

regarding the question, does Stan return variables on the unconstrained scale, in CmdStan, sampler argument 'diagnostic_file' is used to report the model variables on the unconstrained scale.
because this is its own CSV file, the variable names remain the same.

does this answer the above question?

having access to the parameters on the unconstrained scale is useful if you're interested in the posterior geometry and how well an algorithm can deal with it.

One can access the unconstrained samples with pystan, but it needs to be done explicitly.

the underlying mechanism here is calling the Stan model class member function log_prob to get the unconstrained parameter values.