Question/feature: perform posterior inference
itsdfish opened this issue · 26 comments
Hello,
I am interested in performing posterior interference with various simulation models, such as agent based models. I was wondering whether that is possible with your package. One example in Python is BayesFlow. If this is not possible currently, I think it would be a very useful feature to add.
As a simple example, suppose I have a Gaussian model and I want to update the prior on mu and sigma after observing 50 observations. Would that be possible? Here is partial code:
using Distributions
function sample_prior()
μ = rand(Normal(0, 1))
σ = rand(Uniform(0, 10))
return [μ,σ]
end
function generate_data()
θ = sample_prior()
y = rand(Normal(θ...))
return [θ...,y]
end
# rows: μ, σ, y
training_data = mapreduce(x -> generate_data(), hcat, 1:10000)
# train some data
observed_data = rand(Normal(0, 1), 50)
# obtain posterior distribution of μ and σ
Yes, our main use for this library is posterior inference and it currently works very well by training on samples on the joint distribution as in your setup.
Im in the middle of travel, but tomorrow I will send a script on how to train on the training pairs you made.
Awesome. I would very much appreciate an example.
Also, does it support multivariate distributions?
An example would be great. I've been looking into doing Variational Inference with Normalizing Flows but I have yet to find a complete Julia example.
Started a PR so that we can add some more examples. Started with the setup that @itsdfish gave. The basic script is here https://github.com/slimgroup/InvertibleNetworks.jl/pull/84/files#diff-231dfb51cc0355965e85e0fedb61956dd6e52e86eb974de512ede26feb55d8b6
using Distributions
function sample_prior()
μ = rand(Normal(0, 1))
σ = rand(Uniform(0, 10))
return [μ,σ]
end
function generate_data(num_obs)
θ = sample_prior()
y = rand(Normal(θ...), num_obs)
return [θ...,y...]
end
# train using samples from joint distribution x,y ~ p(x,y) where x=[μ, σ] -> y = N(μ, σ)
# rows: μ, σ, y
num_params = 2; num_obs = 50; n_train = 10000
training_data = mapreduce(x -> generate_data(num_obs), hcat, 1:n_train);
############ train some data
X_train = reshape(training_data[1:num_params,:], (1,1,num_params,:));
Y_train = reshape(training_data[(num_params+1):end,:], (1,1,num_obs,:));
n_epochs = 2
batch_size = 200
n_batches = div(n_train,batch_size)
# make conditional normalizing flow
using InvertibleNetworks, LinearAlgebra, Flux
L = 3 # RealNVP multiscale layers
K = 4 # Coupling layers per scale
n_hidden = 32 # Hidden channels in coupling layers' neural network
G = NetworkConditionalGlow(num_params, num_obs, n_hidden, L, K;);
opt = ADAM(4f-3)
# Training logs
loss_l2 = []; logdet_train = [];
for e=1:n_epochs # epoch loop
idx_e = reshape(1:n_train, batch_size, n_batches)
for b = 1:n_batches # batch loop
X = X_train[:, :, :, idx_e[:,b]];
Y = Y_train[:, :, :, idx_e[:,b]];
# Forward pass of normalizing flow
Zx, Zy, lgdet = G.forward(X, Y)
# Loss function is l2 norm - logdet
append!(loss_l2, norm(Zx)^2 / prod(size(X))) # normalize by image size and batch size
append!(logdet_train, -lgdet / prod(size(X)[1:end-1])) # logdet is already normalized by batch size
# Set gradients of flow
G.backward(Zx / batch_size, Zx, Zy)
# Update parameters of flow
for p in get_params(G)
Flux.update!(opt,p.data,p.grad)
end;
clear_grad!(G)
print("Iter: epoch=", e, "/", n_epochs, ", batch=", b, "/", n_batches, "; f l2 = ", loss_l2[end],
"; lgdet = ", logdet_train[end], "; full objective = ", loss_l2[end] + logdet_train[end], "\n")
end
end
# posterior inference on unseen observation
x_ground_truth = [1,1] # mu=1, sigma=1
observed_data = reshape(rand(Normal(x_ground_truth[1], x_ground_truth[2]), num_obs), (1,1,num_obs,:))
# posterior sampling with conditional normalizing flow
num_post_samps = 1000
ZX_noise = randn(1,1,num_params,num_post_samps)
Y_forward = repeat(observed_data, 1, 1, 1, num_post_samps)
_, Zy_fixed_train, _ = G.forward(ZX_noise, Y_forward); #needed to set the proper transforms on inverse
X_post = G.inverse(ZX_noise, Zy_fixed_train );
# plot results
using PyPlot
X_prior = mapreduce(x -> sample_prior(), hcat, 1:num_post_samps)
X_prior = reshape(X_prior[1:num_params,:], (1,1,num_params,:))
fig = figure()
subplot(1,2,1)
hist(X_prior[1,1,1,:];alpha=0.7,density=true,label="Prior")
hist(X_post[1,1,1,:];alpha=0.7,density=true,label="Posterior")
axvline(x_ground_truth[1], color="k", linewidth=1,label="Ground truth")
xlabel(L"\mu"); ylabel("Density");
legend()
subplot(1,2,2)
hist(X_prior[1,1,2,:]; alpha=0.7,density=true,label="Prior")
hist(X_post[1,1,2,:]; alpha=0.7,density=true,label="Posterior")
axvline(x_ground_truth[2], color="k", linewidth=1,label="Ground truth")
xlabel(L"\sigma"); ylabel("Density");
legend()
tight_layout()
And after two epochs, the posterior samples look like this for mu=1 std=1
@itsdfish yes it is quite good for multivariate distributions. I actually think that the networks we have shine best on posterior distributions over images because they are currently mostly conv net based. Although, we are working on dense network for 1d inputs in #77
Will be happy to help anyone get these working for their specific applications and then we can mold them to nice examples.
Thank you for providing an example. This is very helpful. While looking through the code, I thought of a few high level questions. I hope you don't mind me asking some basic questions (machine learning is outside of my expertise).
- Is it possible to perform inference on a data set of any size, as is possible in BayesFlow? In the example you provided, the training size and the observed size for inference is
num_obs
. An error is thrown if they are different. Unfortunately, this limits the usefulness of the network. - Why do you use 4 dimensional arrays (e.g., X_train etc)? What are the first two dimensions?
- Can this example serve as a template for models with more parameters and data dimensions?
- What kind of neural network are you using? Is it a conditional auto encoder?
- Are there any types of models that this package is not suitable for? For example, time series data, or non iid?
Thanks again!
Dont mind at all! thank you for the interest and good questions:
1.) Yes it is possible. In general you need a summary statistic that transorms arbitrary sized observations to a single fixed size. Either learned summary statistic (summary network) like in BayesFlow or a transformation related to the forward operator like in:
https://doi.org/10.1117/12.2651691
and
https://openreview.net/pdf?id=LoJG-lUIlk
The learned summary networks will be merged soon in #82
2.) This isn't anything inherent. The 4d arrays are just a result of our lab applications being mostly on structured images that have width height dimension and 5d tensors for volume inputs. So to hack in 1d inputs with the current code we put the data into the channel dimension and the rest are singletons. It looks like the community is also interested in data with 1 dimension so we are adding dense layers and logic to handle 1 dimension in the PR of dense layers #77
3.) Yes, currently it should work for any parameter and data dimension. I played around with a few without errors but if you run into anything let me know and I might be able to point to a solution quicker.
4.) I cant think of anything that doesnt work off of the top of my head. There is some technical details on the limits of normalizing flows for learning distributions. It has to do with there existing a diffeomorphism between the data distribution and the base distribution but in my experience this is mostly a technical detail and they have been practically used on essentially every type of data application. so AFAIK, the package should work for any marginal distribution and any conditional distribution. The marginal you just need samples x~p(x). But the conditional distributions can be trained with different types of samples depending on the scenario. For the above example (amortized posterior) you need samples from the joint distribution. But you can also do non-amortized Variational inference that only requires a single observation and access to the forward model and its adjoint (if that was part of the process used to make the observation). This method is used in tandem with traditional Amortized variational inference in one of the papers in this package readme:
https://arxiv.org/abs/2207.11640
https://github.com/slimgroup/ReliableAVI.jl
Awesome! Thanks for your answers. The approaches you described in your answer to my first question will be very useful to the Julia community because there are so many types of models for which Bayesian inference is challenging.
I would like to take you up on your offer to develop some tutorial examples. If you would be willing to adapt your gaussian example to use the summary statistic technique once #82 is merged, I can use it to develop tutorials for different types of models. Here are some initial thoughts:
-
Multivariate Gaussian as a simple demonstration of multivariate distributions
-
A decision making model called the Linear Ballistic Accumulator, which describes how people choose between multiple options. The distribution is multivariate where one dimension (choice) is discrete, and the other is continuous (response time). The LBA has a tractable likelihood function, which would allow us to potentially compare the invertible network to results from Turing.jl. However, there are many decision making models that are only simulation based. So this would be of broad interest to that community.
-
An agent based model from Agents.jl. The developers of Agents.jl have several example models available. I think this could become very popular with that modeling framework. In fact, there was recently a question on discourse which prompted me to look for possible solutions.
-
I think it would be a good idea to demonstrate a few things, such as saving trained networks (presumably with BSON.jl), how to generate posterior predictive distributions, and perform quality checks. I would be happy to help with these.
When all is said and done, it might be worth wrapping these procedures into useful functions and putting them in a separate package in the slimgroup organization. If that turns out to be valuable, I can also assist with that.
@rafaelorozco, I would like to make an tutorial example for model of discrete data. The constructor for NetworkConditionalGlow
thew an "inexact" error. So I was wondering if it is even possible. I added the code here in case its possible and you would be willing to have a look.
Update
It seems like the problem that NetworkConditionalGlow
fails when num_params = 1
. If I duplicate the parameter theta so that there are two parameters, it seems to work correctly and agree with the analytic posterior distribution of theta.
I came to answer the question but you arrived to exactly the correct conclusion in the update! It is also cool that you came up with the hack that I typically use when doing invertible layers on arrays of length 1.
Thank you so much for the interest and the cool ideas for examples. Im working with @mloubout to merge the summary network PR soon. We have been using them successfully for a while, just want to make the interface a little bit more user friendly.
The network saving and loading is super important so I will make that example right now for the PR.
Awesome! I look forward to digging in and making some tutorial examples once the summary network is complete.
One of my goals is to define a user-friendly interface for training and sampling. Maybe with a bit more thought, it can be flexible, general and extendable. Here is the first iteration. See train!
and sample_posterior
, which wrap around the code you provided. The training info is displayed with a progress bar using ProgressMeter.jl
I spent some time looking at the code but had a hard time coming up with suggestions. Looks good! I think it is a matter of increasing complexity of examples and seeing where things break and need to be more generalizable.
I added an example of loading and saving networks here in the PR:
There was an error with BSON so you need to be on that PR for it to work.
Awesome. Thanks for adding functionality for saving and loading trained networks. This will be very useful!
Thanks for looking over the code. I did encounter an issue while trying to develop a multivariate gaussian model. The constructor for NetworkConditionalGlow
threw a similar inexact error as before. It appears that it throws an error anytime n_parms
is odd. Perhaps the dimensions of arrays must be 2^n, or even. Is this something were we need some logic to add a redundant parameter and remove it later?
I do have another question. When extending the y_train
to a multivariate gaussian with 2 variables, are the dimensions supposed to be (1,2,n_obs,n_train)? That's my best guess, but I anticipate that might be the next problem.
update
I forgot to add a link above to the training data: https://github.com/itsdfish/NormalizingFlowsTutorials.jl/blob/066c32a2030bc8c7a67f6d434a1b92a57d1b15f6/sandbox/mv_gaussian.jl#L46
aah great catch thank you for bringing that up. I thought I had fixed it but I missed it in that layer. It is a simple problem related to the invertible coupling layer. It is an easy fix but I forgot to also do it for the conditional layers.
With regards to y_train, we will need to go to proper 1dim implementation that looks like (nx, n_chan, n_train). Where nx is the main size of the variable (in your case n_obs). and n_chan would be used for multiple channels or 2 variables in this case. Again, this is working for non-conditional layer, just need to port it to conditional layers. Should be able to get it done by tomorrow.
Thank you for the very useful input!
No problem. It's part of the process. Thanks for looking into that bug!
I tried restructuring y_train
as a (n_obs, n_dim, n_train) array, but that did not work. Here is the updated code. When you say 1dim, do you mean one column vector with n_obs * n_dim * n_train elements?
In either case, I was wondering whether there is a general representation that can be used. For example, in the univariate Gaussian example you made, y_train
had four following dimensions: (1,1,n_obs,n_trian). The univariate case is a special where the number of dimensions of the data is one. So if we could use a 3D array with dimensions (n_obs, n_dim, n_train), it would cover most use cases. I don't have a good mental of the package. So I am not sure what a good design approach would aside from aiming for simplicity and consistency if possible. Do you have any suggestions? Thanks!
To get (n_obs, n_dim, n_train) working You might need to change some of the book keeping that happens in train! and how you make the network (hopefully just adding the kwarg ndim=1 since default is ndim=2). I will take a look and maybe make a PR with the needed changes.
The way we implemented it is with 5d, 4d and 3d tensors.
In all cases, the last two dimensions are for channel and batch size.
5d is for volumes (nx,ny,nz,n_channel,nbatch)
4d is for 2d images (nx,ny,n_channel,nbatch)
3d is for 1d arrays (nx,n_channel,nbatch)
So yes, we agree with your conclusion for 1d arrays. We just call n_dim, n_channel -> (n_obs, n_dim, n_train) -> (nx, n_channel, n_train).
We need the 5d and 4d tensors because for structured volume/image data we want the object to maintain its shape and pixel structure as it goes through the network.
I am slowly starting to understand. Keyword slowly. My understanding is that:
- inputs for
x_train
andy_train
will each be 3D, 4D or 5D arrays. - The last two dimensions represent channel and n_train.
- From what I gather, it seems like array slicing for batches always occurs along the last dimension.
If these are true, then I think the primary train!
function could be generic and receive 3D, 4D, or 5D training inputs, and we can slice x_train
and y_train
for batches along the last dimension. I am not aware of a general way approach to slice along the last dimension for an array of arbitrary size. However, one workaround might be to define a get_batch
function for the 3D, 4D, and 5D case, e.g.,
function get_batch(x::AbstractArray{T,3}, index_range) where {T}
return x[:,:,index_range]
end
There would be two other methods---one for the 4D case and one for the 5D case.
I think this could work, unless one of my assumptions is wrong, or something changes when the summary network is added.
If what I have described seems viable, I think the missing piece in my understanding is how the dimensions in x_train
and y_train
work for the 3D, 4D and 5D cases. My apologies in advance if I am totally off. In either case, I think this demonstrates the need to include thorough explanation in the doc strings and examples. Thanks again for your help!
I am not aware of a JuliaLang/julia#5405 approach to slice along the last dimension for an array of arbitrary size.
You can yes, get_batch(x:: AbstractArray{T, N}, slice) = selectdim(x, N, slice)
will always slice thew last dimension
Thanks! That is a very elegant solution. This is my first time coming across selectdim
. I updated my code accordingly. The main sticking point I am having is understanding the rule for defining the dimensions for x_train
and y_train
for the 3D, 4D, and 5D cases. For the 3D case with a multivariate normal, I have (1,n_parms,n_train)
for 'x_trainand
(n_obs, n_dim, n_trian)for
y_train`, where n_parms is the number of parameters, n_dim is the number of dimensions of the data, n_obs is the number of observations, and n_train is the number of training examples. If anyone could explain that to me, I would greatly appreciate it and can move forward with some examples.
For coupling layers to work, In general, you want x and y to agree on their main dimensionality. For example, if x and y are images you want the images to be of the same size. With summary networks you can do away with this assumption because you can learn a transformation that brings them to the same size (a great advantage of summary networks that isnt talked about as much because not as flashy). With the hacky x_train (1,1,nparams,ntrain) y_train (1,1,nobs,ntrain) things worked because the main dimensionality of the first two dimensions is the same (number of channels can be different without a problem). So when moving to the correct implementation of arrays with 3D tensors
(1,n_parms,n_train) (n_obs, n_dim,n_train) things get complicated because we cant guarantee n_obs to be 1.
So in this case, Im pretty sure we need a summary network that will bring n_obs to be the same size as either n_params or 1 so that you can have dimensions work.
So we are back to waiting on me to finish the summary network PR. I will work on it today and should be ready pretty soon!
Got it. Thank you for explaining how this works. I'm starting to wrap my head around it. I agree with you. It's better to wait until the summary network is finished since it will have different requirements.
Awesome! I look forward to the PR whenever it's ready.
@rafaelorozco, I noticed that you merged the PR for the summary network a few weeks ago. I was wondering if you would be able to help me adapt the Gaussian example (or other example) to be compatible with the summary network? Thanks!
Hey! Sorry for the late reply I have been busy with a conference.
Yes I would be happy to help getting it setup. I tried implementing the invariant layer as a summary network as described in Bayesflow but was not getting good results. Maybe it needs to be a fancier layer like the deepset that they have in their code.
To have the amortizednet you need to be on the newest master branch but this is the updated code.
using Distributions
function sample_prior()
μ = rand(Normal(0, 1))
σ = rand(Uniform(0, 10))
return [μ,σ]
end
function generate_data(num_obs)
θ = sample_prior()
y = rand(Normal(θ...), num_obs)
return [θ...,y...]
end
# train using samples from joint distribution x,y ~ p(x,y) where x=[μ, σ] -> y = N(μ, σ)
# rows: μ, σ, y
train_obs_low = 50
train_obs_high = 100
num_params = 2; num_obs = 100; n_train = 10000
#training_data = mapreduce(x -> generate_data(num_obs), hcat, 1:n_train);
############ train some data
#X_train = reshape(training_data[1:num_params,:], (1,1,num_params,:));
#Y_train = reshape(training_data[(num_params+1):end,:], (1,1,num_obs,:));
n_epochs = 4
batch_size = 200
n_batches = div(n_train,batch_size)
# make conditional normalizing flow
using InvertibleNetworks, LinearAlgebra, Flux
struct invariantmodel
h1::Chain
h2::Chain
num_summary
end
function (m::invariantmodel)(x)
x_sum = m.h1(x[1,1,1:1,:])
for i in 2:size(x)[1]
x_sum += m.h1(x[1,1,i:i,:])
end
reshape(m.h2(x_sum),1,1,num_summary,:)
end
Flux.@functor invariantmodel
num_summary = num_params # size of learned summary
n_hidden_summary = 64 # size hidden channels in summary networks
h1 = Chain(Dense(1,n_hidden_summary),Dense(n_hidden_summary,num_summary))
h2 = Chain(Dense(num_summary,n_hidden_summary),Dense(n_hidden_summary,num_summary))
learned_summary = FluxBlock(Chain(invariantmodel(h1,h2,num_summary)))
L = 3 # RealNVP multiscale layers
K = 4 # Coupling layers per scale
n_hidden = 32 # Hidden channels in coupling layers' neural network
cond_net = NetworkConditionalGlow(num_params, num_summary, n_hidden, L, K;);
G = SummarizedNet(cond_net, learned_summary)
opt = ADAM(4f-3)
# Training logs
loss_l2 = []; logdet_train = [];
loss_l2_test = []; logdet_test = [];
for e=1:n_epochs # epoch loop
idx_e = reshape(1:n_train, batch_size, n_batches)
for b = 1:n_batches # batch loop
num_obs = rand(train_obs_low:train_obs_high) # sample different number of training samples during training
training_data = mapreduce(x -> generate_data(num_obs), hcat, 1:batch_size);
X = reshape(training_data[1:num_params,:], (1,1,num_params,:));
Y = reshape(training_data[(num_params+1):end,:], (1,1,num_obs,:));
# Forward pass of normalizing flow
Zx, Zy, lgdet = G.forward(X, Y)
# Loss function is l2 norm - logdet
append!(loss_l2, norm(Zx)^2 / prod(size(X))) # normalize by image size and batch size
append!(logdet_train, -lgdet / prod(size(X)[1:end-1])) # logdet is already normalized by batch size
# Set gradients of flow
G.backward(Zx / batch_size, Zx, Zy; Y_save=Y)
# Update parameters of flow
for p in get_params(G)
Flux.update!(opt,p.data,p.grad)
end;
clear_grad!(G)
print("Iter: epoch=", e, "/", n_epochs, ", batch=", b, "/", n_batches, "; f l2 = ", loss_l2[end],
"; lgdet = ", logdet_train[end], "; full objective = ", loss_l2[end] + logdet_train[end], "\n")
end
end
# posterior inference on unseen observation
num_obs_test = 75
x_ground_truth = [1,1] # mu=1, sigma=1
observed_data = reshape(rand(Normal(x_ground_truth[1], x_ground_truth[2]), num_obs_test), (1,1,num_obs_test,:))
# posterior sampling with conditional normalizing flow
num_post_samps = 1000
ZX_noise = randn(1,1,num_params,num_post_samps)
Y_forward = repeat(observed_data, 1, 1, 1, num_post_samps)
_, Zy_fixed_train, _ = G.forward(ZX_noise, Y_forward); #needed to set the proper transforms on inverse
X_post = G.inverse(ZX_noise, Zy_fixed_train );
# plot results
using PyPlot
X_prior = mapreduce(x -> sample_prior(), hcat, 1:num_post_samps)
X_prior = reshape(X_prior[1:num_params,:], (1,1,num_params,:))
fig = figure()
subplot(1,2,1)
hist(X_prior[1,1,1,:];alpha=0.7,density=true,label="Prior")
hist(X_post[1,1,1,:];alpha=0.7,density=true,label="Posterior")
axvline(x_ground_truth[1], color="k", linewidth=1,label="Ground truth")
xlabel(L"\mu"); ylabel("Density");
legend()
subplot(1,2,2)
hist(X_prior[1,1,2,:]; alpha=0.7,density=true,label="Prior")
hist(X_post[1,1,2,:]; alpha=0.7,density=true,label="Posterior")
axvline(x_ground_truth[2], color="k", linewidth=1,label="Ground truth")
xlabel(L"\sigma"); ylabel("Density");
legend()
tight_layout()
# Look at training curve
fig = figure("training logs ", figsize=(10,12))
subplot(3,1,1); title("L2 Term")
plot(loss_l2, label="train");
axhline(y=1,color="red",linestyle="--",label="Standard Normal")
xlabel("Parameter Update"); legend();
subplot(3,1,2); title("Logdet Term")
plot(logdet_train);
xlabel("Parameter Update") ;
subplot(3,1,3); title("Total Objective")
plot(loss_l2 + logdet_train);
xlabel("Parameter Update") ;
tight_layout()
Note that during training we change the number of observations so that it can hopefully learn to generalize over that.
@rafaelorozco, no problem at all. I completely understand and appreciate your time. Thanks for the example above. I should have time to look through it later today or tomorrow. By the way, I enjoyed reading your pre-print Refining Amortized Posterior Approximations using Gradient-Based Summary Statistics . It looks like you are getting a nice improvement with your new method.
I can see the problem you mentioned. The posterior distribution does not change much from the prior distribution, even if the sample size is at max (100). I didn't see much information about DeepSets or SetTransformer (the new approach used in BayesFlow) in Julia. I'm not sure if DeepSets and SetTransforms are variations on existing nn architectures in Julia. Do have a sense for whether this is even possible?
Oh thank you for reading the paper! We will put out a full length journal related to that technique which will bring together a lot of the things we have been working on using this package.
I think it is definitely possible to implement the set based NN layers. I looked again at the bayesflow code and the papers they cite for deepset. They seem to be the same in principal to what I was attempting in that code. I suspect that it is a matter of playing with hyperparameters and the architecture. I might ask some people in the lab to look into it and will report back if we have success.