dfm/emcee

Saving non-arrayifiable data at each likelihood evaluation (either to disk or in memory)

jcblemai opened this issue · 4 comments

General information:

  • emcee version: 3.1.4
  • platform: OS X, Linux
  • installation method (pip/conda/source/other?): conda or pip

Problem description:
We have an epidemic modeling pipeline (flepimop.org) with a custom MCMC inference engine, from our trials it seems that EMCEE outperforms it. So we are really excited to include it in our package.

However, our pipeline saves some outputs (in 6 parquet files) that are needed for us at each evaluation. The filenames follow a convention. For example, the spar metadata for the 1000 iteration of worker 17 is stored model_output/rsv_CA_SIR_v6_none_IHRadj/CA_v6/spar/global/intermediate/000000017.000001000.CA_v6.spar.parquet.

My question is: Is it possible for workers to know their number, and to know which iteration is currently being done? So that I can write this file (and if not, can use blob to save non-arrayifiable data?)

It is probably a very common problem: as our simulations are expensive, we cannot just rerun them at the end for the values stored in the chain.

I have tried various hacks (such as running sample in a loop that updates the log prob function with a lambda that includes the iteration number) but I got nothing that worked correctly, so I would appreciate any help on that :). It seems like it may not be possible without modifying emcee/ensemble.py ?

Best regards, and thanks for EMCEE, happily using it since 2018 !! and always so performant !

dfm commented

Good question! I'm not sure that there's anything that will work out of the box.

Just so that I'm sure I understand: The idea here is that within your evaluation of the log prob function, you need to be able to pass the iteration number and the walker id to your computational model? Something like:

def log_prob_func(params):
    # Note: these functions don't exist
    walker_id = get_walker_id()
    mcmc_iteration_number = get_mcmc_iteration_number()
    
    log_prob = expensive_pipeline_that_writes_files(params, walker_id, mcmc_iteration_number)

    return log_prob

right?

To get the walker id number you could set the vectorize=True in your EnsembleSampler, although even in this case it would be a little bit awkward because only half of the walkers are passed for each call and we don't have access to all the appropriate information.

But, if you have some control over how the files are saved, here's an idea for a workaround. Instead of tracking the walker and step number, and handling the bookkeeping yourself, you could instead get a unique id for each step and use that in the filenames. Here's a simple example:

import numpy as np
import emcee

import hashlib
import uuid
import pathlib

def log_prob(x):
    # Hash the input parameters
    id = hashlib.sha256(x.data).hexdigest()

    # OR

    # A random unique id
    id = uuid.uuid4().hex

    # Build a filename based on the unique id
    filename = f"samples/{id[:6]}/{id[6:]}"

    # A placeholder for writing the appropriate file
    pathlib.Path(filename).parent.mkdir(parents=True, exist_ok=True)
    with open(filename, "w") as f:
        f.write(str(x))

    # Return the filename as a "blob"
    return -0.5 * np.sum(x ** 2), filename

ndim, nwalkers = 2, 10
p0 = np.random.randn(nwalkers, ndim)

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_prob)
sampler.run_mcmc(p0, 12)
sampler.get_blobs()

Here we're indexing the filename for each step by either:

  1. Hashing the buffer behind the parameters, or
  2. Generating a random unique id using the uuid module.

Then, we track this unique id using blobs. At the end of the run, the blobs array contains the appropriate filename that corresponds to each sample, taking into account whether or not it was accepted.

Hope this helps!

This is a truly great idea and thanks for providing an example implementation. It works like a charm (I came up with a much dirtier solution, based on running in a loop with different function arguments at each iteration, but it misses the worker identification which is important to me). We are rolling emcee on flepiMoP! Let me know if you'd like me to add your answer to the emcee's documentation.

Thank you so much. I have a small clarification question that I did not get from the doc. I believe the blobs are updated on "accept" (not for every iteration). e.g

  • If iteration 4 proposes parameter p_4, and it is accepted, then my blob/files are the one obtained by running with p_4. All good.
  • if p_5 is proposed but gets rejected, does my blob correspond to the run from p_5 or p_4?

I believe it is the blob corresponding with p_4 (from reading the code, I'm not on my laptop to test right now), which is great for my use case. This is correct?

dfm commented

if p_5 is proposed but gets rejected, does my blob correspond to the run from p_5 or p_4?

You're right, the blob will correspond to the run for p_4 (the previously accepted sample) rather than p_5 (the rejected sample). In other words, the blobs array will generally have duplicates in it. I think this is typically what you would want, and it might be a bit tricky to get the other behavior.

Great, that's indeed my expected behaviour. Thank you Dan for everything! Implementing this neat solution :)