Multiprocessing the Python module 'emcee', but not all CPU cores on the machine are being used
Maryamtajalli opened this issue · 6 comments
Hello,
I am parallelizing emcee using multiprocessing module as stated in the emcee document. However, htop shows that the program keeps using a limited number of cores (26-27). The computer I am running my code on has 80 cores and I would like to make use of all of them to speed up the code. Could you please help me with this? I did not find any solution in the document. This only happens with emcee and when I use multiprocessing module in other programs I see with htop that all 80 cores are being used.
This is my code:
with Pool() as pool:
sampler = emcee.EnsembleSampler(nwalkers, npars, logpfunc, pool=pool)
start = []
if len(sys.argv) > 1:
print('using last spread of %s to initialize walkers'%sys.argv[1])
startfile = h5py.File('%s'%sys.argv[1], 'r')
for i in range(nwalkers):
tmp = np.zeros(npars)
for nn in range(npars):
tmp[nn] = startfile[pars[nn]['name']][i, -1]
start.append(tmp)
startfile.close()
else:
for i in range(nwalkers):
tmp = np.zeros(npars)
for j in range(npars):
a, b = (bounds[j][0] - pars[j]['guess'])/pars[j]['spread'], (bounds[j][1] - pars[j]['guess'])/pars[j]['spread']
p0 = truncnorm.rvs(a, b, size=1)*pars[j]['spread'] + pars[j]['guess']
tmp[j] = p0
start.append(tmp)
print('Sampling')
sampler.run_mcmc(start, nstep, progress=True)
emcee
will use a number of cores equal to half the number of walkers. So in your case, you might want to use 160 walkers to take advantage of the number of cpus you have. Sometimes it can be worth also parallelizing your model evaluation so that it executes in parallel too, and then use a smaller number of walkers.
emcee
will use a number of cores equal to half the number of walkers. So in your case, you might want to use 160 walkers to take advantage of the number of cpus you have. Sometimes it can be worth also parallelizing your model evaluation so that it executes in parallel too, and then use a smaller number of walkers.
I see, thank you very much. I once tried to parallelize my model evaluation since there is an inevitable for loop in my logposterior function which is the main reason my posterior is computationally expensive, but I ran into the error:
daemonic processes are not allowed to have children
Yeah, I think it's right that you can't directly use multiprocessing for the model parallelization. I normally parallelize at a lower level (using numpy or hand-written C code), but that certainly isn't always straightforward, so perhaps using more walkers will do the trick for you!
Yeah, I think it's right that you can't directly use multiprocessing for the model parallelization. I normally parallelize at a lower level (using numpy or hand-written C code), but that certainly isn't always straightforward, so perhaps using more walkers will do the trick for you!
Hello, I was thinking about one of your suggestions: writing a C code for model evaluation. My log_posterior function contains an integral inside a for loop, and using scipy.integrate makes my code super slow but the values of the log_posterior much more accurate. I have made use of other numerical integration methods (such as Monte Carlo integration) to speed up calculations, but as I said, the results are not as precise as scipy.integrate. I was wondering whether it would be possible to write my log_posterior function in C and then somehow wrap this computationally expensive function into my main Python code and pass it to emcee
.
Also, do you think using async / await in order to run both the log_posterior function and emcee
in parallel can improve the performance?
I would appreciate if you could let me know if these are good solutions for speeding up emcee
or not.
Speeding up emcee doesn't really seem like the right phrase to use here, given that the rate limiting step seems to be your log-posterior function.
You may have already followed these steps for improved efficiency, forgive me if you have:
- Try and eliminate as many loops as you can using numpy broadcasting and vectorisation. It's often possible to remove nested loops that are performance killers. I would focus on the
func
you're passing toscipy.integrate
first. - Can you vectorise
func
? For example,quadrature
would speed up if you can pass afunc
that receives and returns vector arrays. - Alternatively, since you mention that you call a
scipy.integrate
function in a for loop, could you eliminate that loop by using [quad_vec
(https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.integrate.quad_vec.html#scipy.integrate.quad_vec).
Note that the difference between 2 and 3 is that quadrature
is returning the result from a single integration, quad_vec
is returning results from multiple integrations.
- You can often get good speedups by cythonising
func
(either cython or wrapped C), but that's a more involved process. This process becomes much easier if you've already tried point 1. - If you carry out 4., then you can improve yet again by generating a
LowLevelCallable
from the cython function, and passing it toscipy.integrate.quad
. The overhead from the external quadpack library calling Python functions repeatedly can be significant.
From my own experience (having experienced similar performance issues), it's not uncommon to get orders of magnitude speedup. I went from a naive Python implementation with nested for loops to C, and gained at least an order of magnitude. I then understood my algorithm better, which enabled me to vectorise (a la step 1), which brought the python code to within a factor of 2 of the C.
Dear Andrew, thank you very much for your great suggestions. I vectorized func
and using quadraure
, the code speeded up by a factor of almost 2, although the results from quadrature
were not as accurate as quad
. I will also try to cythonize func
and see how things will be further improved.
Speeding up emcee doesn't really seem like the right phrase to use here, given that the rate limiting step seems to be your log-posterior function. You may have already followed these steps for improved efficiency, forgive me if you have:
- Try and eliminate as many loops as you can using numpy broadcasting and vectorisation. It's often possible to remove nested loops that are performance killers. I would focus on the
func
you're passing toscipy.integrate
first.- Can you vectorise
func
? For example,quadrature
would speed up if you can pass afunc
that receives and returns vector arrays.- Alternatively, since you mention that you call a
scipy.integrate
function in a for loop, could you eliminate that loop by using [quad_vec
(https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.integrate.quad_vec.html#scipy.integrate.quad_vec).Note that the difference between 2 and 3 is that
quadrature
is returning the result from a single integration,quad_vec
is returning results from multiple integrations.
- You can often get good speedups by cythonising
func
(either cython or wrapped C), but that's a more involved process. This process becomes much easier if you've already tried point 1.- If you carry out 4., then you can improve yet again by generating a
LowLevelCallable
from the cython function, and passing it toscipy.integrate.quad
. The overhead from the external quadpack library calling Python functions repeatedly can be significant.From my own experience (having experienced similar performance issues), it's not uncommon to get orders of magnitude speedup. I went from a naive Python implementation with nested for loops to C, and gained at least an order of magnitude. I then understood my algorithm better, which enabled me to vectorise (a la step 1), which brought the python code to within a factor of 2 of the C.