Multiprocessing the Python module 'emcee', but not all CPU cores on the machine are being used

Question

Multiprocessing the Python module 'emcee', but not all CPU cores on the machine are being used

Maryamtajalli opened this issue 3 years ago · 6 comments

Hello,

I am parallelizing emcee using multiprocessing module as stated in the emcee document. However, htop shows that the program keeps using a limited number of cores (26-27). The computer I am running my code on has 80 cores and I would like to make use of all of them to speed up the code. Could you please help me with this? I did not find any solution in the document. This only happens with emcee and when I use multiprocessing module in other programs I see with htop that all 80 cores are being used.

This is my code:

with Pool() as pool:

        sampler = emcee.EnsembleSampler(nwalkers, npars, logpfunc, pool=pool)

        start = []
        if len(sys.argv) > 1:
            print('using last spread of %s to initialize walkers'%sys.argv[1])
            startfile = h5py.File('%s'%sys.argv[1], 'r')

            for i in range(nwalkers):
                tmp = np.zeros(npars)
                for nn in range(npars):
                    tmp[nn] = startfile[pars[nn]['name']][i, -1]
                start.append(tmp)
            startfile.close()

        else:
            for i in range(nwalkers):
                tmp = np.zeros(npars)
                for j in range(npars):
                    a, b = (bounds[j][0] - pars[j]['guess'])/pars[j]['spread'], (bounds[j][1] - pars[j]['guess'])/pars[j]['spread']
                    p0 = truncnorm.rvs(a, b, size=1)*pars[j]['spread'] + pars[j]['guess']
                    tmp[j] = p0

                start.append(tmp)

        print('Sampling')

        sampler.run_mcmc(start, nstep, progress=True)

Answer 1 · 2021-09-20T12:18:29.000Z

emcee will use a number of cores equal to half the number of walkers. So in your case, you might want to use 160 walkers to take advantage of the number of cpus you have. Sometimes it can be worth also parallelizing your model evaluation so that it executes in parallel too, and then use a smaller number of walkers.

Answer 2 · 2021-09-20T12:38:26.000Z

emcee will use a number of cores equal to half the number of walkers. So in your case, you might want to use 160 walkers to take advantage of the number of cpus you have. Sometimes it can be worth also parallelizing your model evaluation so that it executes in parallel too, and then use a smaller number of walkers.

I see, thank you very much. I once tried to parallelize my model evaluation since there is an inevitable for loop in my logposterior function which is the main reason my posterior is computationally expensive, but I ran into the error:

daemonic processes are not allowed to have children

Answer 3 · 2021-09-20T12:57:35.000Z

Yeah, I think it's right that you can't directly use multiprocessing for the model parallelization. I normally parallelize at a lower level (using numpy or hand-written C code), but that certainly isn't always straightforward, so perhaps using more walkers will do the trick for you!

Answer 4 · 2021-09-23T14:33:38.000Z

Yeah, I think it's right that you can't directly use multiprocessing for the model parallelization. I normally parallelize at a lower level (using numpy or hand-written C code), but that certainly isn't always straightforward, so perhaps using more walkers will do the trick for you!

Hello, I was thinking about one of your suggestions: writing a C code for model evaluation. My log_posterior function contains an integral inside a for loop, and using scipy.integrate makes my code super slow but the values of the log_posterior much more accurate. I have made use of other numerical integration methods (such as Monte Carlo integration) to speed up calculations, but as I said, the results are not as precise as scipy.integrate. I was wondering whether it would be possible to write my log_posterior function in C and then somehow wrap this computationally expensive function into my main Python code and pass it to emcee.
Also, do you think using async / await in order to run both the log_posterior function and emcee in parallel can improve the performance?
I would appreciate if you could let me know if these are good solutions for speeding up emcee or not.

Answer 5 · 2021-09-23T22:41:09.000Z

Speeding up emcee doesn't really seem like the right phrase to use here, given that the rate limiting step seems to be your log-posterior function.
You may have already followed these steps for improved efficiency, forgive me if you have:

Try and eliminate as many loops as you can using numpy broadcasting and vectorisation. It's often possible to remove nested loops that are performance killers. I would focus on the func you're passing to scipy.integrate first.
Can you vectorise func? For example, quadrature would speed up if you can pass a func that receives and returns vector arrays.
Alternatively, since you mention that you call a scipy.integrate function in a for loop, could you eliminate that loop by using [quad_vec(https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.integrate.quad_vec.html#scipy.integrate.quad_vec).

Note that the difference between 2 and 3 is that quadrature is returning the result from a single integration, quad_vec is returning results from multiple integrations.

You can often get good speedups by cythonising func (either cython or wrapped C), but that's a more involved process. This process becomes much easier if you've already tried point 1.
If you carry out 4., then you can improve yet again by generating a LowLevelCallable from the cython function, and passing it to scipy.integrate.quad. The overhead from the external quadpack library calling Python functions repeatedly can be significant.

From my own experience (having experienced similar performance issues), it's not uncommon to get orders of magnitude speedup. I went from a naive Python implementation with nested for loops to C, and gained at least an order of magnitude. I then understood my algorithm better, which enabled me to vectorise (a la step 1), which brought the python code to within a factor of 2 of the C.

Answer 6 · 2021-09-29T11:36:12.000Z

Dear Andrew, thank you very much for your great suggestions. I vectorized func and using quadraure, the code speeded up by a factor of almost 2, although the results from quadrature were not as accurate as quad. I will also try to cythonize func and see how things will be further improved.

Speeding up emcee doesn't really seem like the right phrase to use here, given that the rate limiting step seems to be your log-posterior function. You may have already followed these steps for improved efficiency, forgive me if you have:

Try and eliminate as many loops as you can using numpy broadcasting and vectorisation. It's often possible to remove nested loops that are performance killers. I would focus on the func you're passing to scipy.integrate first.

Can you vectorise func? For example, quadrature would speed up if you can pass a func that receives and returns vector arrays.

Alternatively, since you mention that you call a scipy.integrate function in a for loop, could you eliminate that loop by using [quad_vec(https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.integrate.quad_vec.html#scipy.integrate.quad_vec).

Note that the difference between 2 and 3 is that quadrature is returning the result from a single integration, quad_vec is returning results from multiple integrations.

You can often get good speedups by cythonising func (either cython or wrapped C), but that's a more involved process. This process becomes much easier if you've already tried point 1.

If you carry out 4., then you can improve yet again by generating a LowLevelCallable from the cython function, and passing it to scipy.integrate.quad. The overhead from the external quadpack library calling Python functions repeatedly can be significant.

From my own experience (having experienced similar performance issues), it's not uncommon to get orders of magnitude speedup. I went from a naive Python implementation with nested for loops to C, and gained at least an order of magnitude. I then understood my algorithm better, which enabled me to vectorise (a la step 1), which brought the python code to within a factor of 2 of the C.