google-research/torchsde

example/latent_sde_lorenz different from whats written in the paper?

stebuechho opened this issue · 7 comments

Hi there,

first of all thank you for your work! This is great!

Disclaimer: i don't know much about SDE, but i would love to make use of neural sde, since i have some use cases that i think could work with it. But mostly i find it highly interesting! So i hope i didn't get it all wrong entirely.

I think there are quite some differences between your code example and whats decribed in your paper.

In the paper it says, that you have a 1 layer GRU to recover the dynamics and f_net and h_net both are 1 layer MLP, while f_net takes an additional context variable of size 1 . Then, there's a decoder to map back from latent space to feature space.
So if i understand correctly, this model would work just like shown in figure 4 of your paper:
The GRU consumes a (time reversed) sequence of inputs in observation space and outputs a sequence in a 4D latent space. The final output (at t0) then is your initial condition (z0) for the sde, which is integrated through time, producing another sequence in latent space. So the dynamics would happen in latent space. Then the decocer maps back to observation space. What's a little unclear to me, is what the context would be. Just one (the last) latent variable from each of the GRU outputs?
Anyway, this describes an actual latent sde, since all the dynamics are happening in latent space.

However, in your implementation in the example, all the dynamics happen in observation space directly. f_net and h_net both map from observation space to drifts in obeservation space (well f_net again sees an additional context, which is of size 64 here). So the GRU encoder "only " provides the context, but not the initial latent state. Meaning it does not recover the dynamics, if i understood it right.
So this is not an actual "Latent SDE" model, is it? More like a "Latent informed/controlled SDE"?

Similar thing for the latent_sde example. The dynamics are learned in obesrvation space as well, so there is no latent space involved. One could claim that the latent space is equal to the observation space here of course. Or do i simply have the wrong idea of what "Latent SDE/ODE" actually means?

In general, i think this package could greatly benefit from a more in depth documentation in regards to training models with it and/or better explained examples. One or two examples of standart use cases in jupyter notebooks with detailed explanations could help a lot to make this more accessible to people that don't have much background in SDE (like me).

Thanks for your interest!

The GRU consumes a (time reversed) sequence of inputs in observation space and outputs a sequence in a 4D latent space.

You're absolutely right!

What's a little unclear to me, is what the context would be. Just one (the last) latent variable from each of the GRU outputs?

The GRU outputs at intermediate times can also be used for practical performance benefits (e.g. check this out). The searchsorted operation here is used to find the "right" context that's produced only using future observations.

However, in your implementation in the example, all the dynamics happen in observation space directly.

I think it's fair to put it that way, and to say that points in the latent space are mapped to the observed space via an identity transform.

So this is not an actual "Latent SDE" model, is it? More like a "Latent informed/controlled SDE"?

I think you have a fair point. I agree that I was somewhat sloppy with the re-implementation. Essentially, the things that are simplified are 1) variational inference at time t0 (I didn't include the KL penalty, or a prior), and 2) an actual non-trivial decoder that maps points in the latent space back to the observed space.

To properly do 1), one would first select a good prior (say N(mu, sigma); mu and sigma can be optimized during training). To compute the KL penalty, one would need also the variational distribution given by the encoder (e.g. dependent on the last output of the GRU or return value of some other encoder).

In general, i think this package could greatly benefit from a more in depth documentation in regards to training models with it and/or better explained examples. One or two examples of standart use cases in jupyter notebooks with detailed explanations could help a lot to make this more accessible to people that don't have much background in SDE (like me).

Thanks for the suggestion and I totally agree! Happy to spend time in documenting the model better in the future, though quite unfortunately my schedule in the near future seems quite packed.

Thanks for answering! That cleared thing up!

What's a little unclear to me, is what the context would be. Just one (the last) latent variable from each of the GRU outputs?

The GRU outputs at intermediate times can also be used for practical performance benefits (e.g. check this out). The searchsorted operation here is used to find the "right" context that's produced only using future observations.

What i meant was: In the paper a context size of 1 is mentioned, while the latent space is 4 dimensional. I assumend context size of 1 was referring to the latent dimension, so that you took only one of the 4 latent variables of the according timestep of the GRU output sequence as the context. But it referrs to the time dimension, so all for latent variables, got it!

In general, i think this package could greatly benefit from a more in depth documentation in regards to training models with it and/or better explained examples. One or two examples of standart use cases in jupyter notebooks with detailed explanations could help a lot to make this more accessible to people that don't have much background in SDE (like me).

Thanks for the suggestion and I totally agree! Happy to spend time in documenting the model better in the future, though quite unfortunately my schedule in the near future seems quite packed.

Awesome! I am looking forward to it, whenever it may be!

Even though this is porbably not the right place, as github issues are not meant as a forum to ask for help, i'll still try sneak in a few practical questions, i hope you don't mind. I would be thrilled if you could give some answers! However, if that is not appropriate, feel free to shoot me down and close.

In my potential use case, i have a bunch of timeseries data. I would like to fit one sde model to it to make predictions into the future, based on a given portion of timesteps. So it is pretty similiar to the latent_sde_lorenz example. Even more to the example of the geometric brownian motion in your paper, i think.

  1. I am a little confused about the roles of prior h und approx posterior f in this setting. It's probably due to my lack of knowledge in the field, though i think i know what prior and posterior distributions are, but i am not entirely clear by their roles here: Is the approx. posteriors role only to condition the prior, so that we can make good predictions with it? So inferrence would always happen with the prior dirft? In the geometric brownian motion example, you make future predictions with both. But for the model setup as decribed in the last posts, that would mean that the context for the approx. posterior drift would remain static for all t>1, as the GRU didn't give outputs there, is that correct? So i guess that would deteriorate predictions the farther out in the future they are, right?

  2. Instead of the output of one discrete path, a more relevant predicion would be the expected value and its variance at a (future) timestep. Computing those for constant drift and diffusion seems easily possible. But in case of a neual sde, do you know if it is possible to compute those directly? In the example latent_sde, i think you kind of compute them by sampling a bunch of pathes in order to show something like the pdf at each timestep (colorcoded in blue). Having an option to make sdeint compute the expectation and variance natively might be very useful actually!

  3. Since i think this is how the adjoint backward pass is implemented anyway, having an option to integrate backward through time might be very useful too! But to implement it myself, this is a good starting point i guess? (only need to figure out why g**2 is scaled by score and how that translates to my usecase?)

What i meant was: In the paper a context size of 1 is mentioned, while the latent space is 4 dimensional. I assumend context size of 1 was referring to the latent dimension, so that you took only one of the 4 latent variables of the according timestep of the GRU output sequence as the context. But it referrs to the time dimension, so all for latent variables, got it!

The context vector is an extra piece of information that isn't really related to the latent space. It doesn't really refer to the timestamps either. On the other hand, the timestamp is used to select which context vector we should use for integrating the SDE in a specific time interval.

Even though this is porbably not the right place, as github issues are not meant as a forum to ask for help, i'll still try sneak in a few practical questions, i hope you don't mind. I would be thrilled if you could give some answers! However, if that is not appropriate, feel free to shoot me down and close.

I'm closing the issue for now after the fixes, but I'm also happy to keep chatting here if that may be helpful.

I'm splitting this reply into several segments, as one giant grid of text may seem intimidating.

I am a little confused about the roles of prior h und approx posterior f in this setting.

If you're familiar with Gaussian processes, I'd say that it's reasonable to think that the prior here is analogous to the prior there when you're only fitting a single time series sequence. In fact, the OU process is a Gaussian process, and this is a special case where the two model classes somewhat coincide. Notably, things are a bit different when one is fitting multiple time series sequences and trying to do interpolation/extrapolation for each sequence individually.

But for the model setup as decribed in the last posts, that would mean that the context for the approx. posterior drift would remain static for all t>1, as the GRU didn't give outputs there, is that correct?

If the goal is extrapolation based on observations of a single time series sequence, I'd recommend using the posterior drift.

Instead of the output of one discrete path, a more relevant predicion would be the expected value and its variance at a (future) timestep. Computing those for constant drift and diffusion seems easily possible. But in case of a neual sde, do you know if it is possible to compute those directly? In the example latent_sde, i think you kind of compute them by sampling a bunch of pathes in order to show something like the pdf at each timestep (colorcoded in blue). Having an option to make sdeint compute the expectation and variance natively might be very useful actually!

I'd agree the naive method would be to estimate the statistics with samples. I'm aware of works that intend to approximately simulate SDEs by only simulating the marginal mean and covariance ODEs. I may not be up-to-date on the latest developments there, but I haven't seen a paper that convincingly demonstrated that such a method is consistently accurate and leads to models of good utility.

More generally, the problem is related to simulating the Fokker Planck, which is known to be difficult aside from special cases.

Since i think this is how the adjoint backward pass is implemented anyway, having an option to integrate backward through time might be very useful too! But to implement it myself, this is a good starting point i guess? (only need to figure out why g**2 is scaled by score and how that translates to my usecase?)

Our adjoint implementation is in this file. The core functions of interest are the reverse drift and diffusions (e.g. here and here).

What you listed in the example is something very different, and comes from another paper. I implemented it for MNIST before they released their codebase, and it was mostly for fun for myself.

Notably, the reverse SDE formulation in that paper is totally different from ours. The backward/time-reverse-SDE formulation in our paper ensures that individual sample paths can be reversed given a fixed Brownian motion sample. The time-reverse SDE in their paper only ensures that the marginal distributions can be reconstructed. Note, one could get the same marginals even with different sample paths.

For their purposes and applications, their reverse-time SDE formulation was sufficient.

Thank you for the answers! I think I'll have to read a little further into the topic and try to set up a small model for my use case when i have the time ( it's just a little side project right now, out of interest ) before asking any more questions here. Thanks for offering to keep on chatting!