Add OpenMP support
Closed this issue · 11 comments
Should be easy to add OpenMP statements like:
omp_set_num_threads(nthreads);
#pragma omp parallel for reduction (-:nll)
before for loops
computing negative log-likelihood in C++ code. Some other clauses like shared
might be necessary. Would require a nthreads
argument for each fitting function.
OpenMP should result in huge speed improvements on Linux. My understanding is that support for OpenMP is limited, but getting better, on Windows and iOS.
I experimented with this a bit:
library(unmarked)
library(microbenchmark)
n <- 10000 # number of sites
T <- 4 # number of primary periods
J <- 3 # number of secondary periods
lam <- 3
phi <- 0.5
p <- 0.3
y <- array(NA, c(n, T, J))
M <- rpois(n, lam) # Local population size
N <- matrix(NA, n, T) # Individuals available for detection
for(i in 1:n) {
N[i,] <- rbinom(T, M[i], phi)
y[i,,1] <- rbinom(T, N[i,], p) # Observe some
Nleft1 <- N[i,] - y[i,,1] # Remove them
y[i,,2] <- rbinom(T, Nleft1, p) # ...
Nleft2 <- Nleft1 - y[i,,2]
y[i,,3] <- rbinom(T, Nleft2, p)
}
y.ijt <- cbind(y[,1,], y[,2,], y[,3,], y[,4,])
umf1 <- unmarkedFrameGMM(y=y.ijt, numPrimary=T, type="removal")
microbenchmark(
m1 = gmultmix(~1,~1,~1, data=umf1, K=30, nthreads=1),
m2 = gmultmix(~1,~1,~1, data=umf1, K=30, nthreads=2),
m4 = gmultmix(~1,~1,~1, data=umf1, K=30, nthreads=4),
times=5)
Unit: seconds
expr min lq mean median uq max neval cld
m1 9.099484 9.111422 9.372926 9.418311 9.614503 9.620908 5 b
m2 5.414858 5.548022 5.629184 5.631519 5.680126 5.871395 5 a
m4 5.349892 5.388611 5.446901 5.482736 5.503174 5.510094 5 a
Looks like there definitely is a performance improvement. However 2 threads and 4 threads were the same for me. My laptop has 2 cores and 2 threads per core so might be related to that? My CPU usage did spike to 100% for the 4-thread test.
I had to change the indexing around a bit for gmultmix
but otherwise didn't have to change any code. I also did have to add the -fopenmp
compiler flag for GCC, although I think there is a way to have Rcpp do that automatically.
Sounds promising. You may have had some other programs using up memory on the other two threads. What OS were you using? I've had great luck lately with OpenMP on Ubuntu.
BTW1: https://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-basics.html
BTW2: I'm thinking threads
would be better than nthreads
. Also, we might want to set it up so that threads=1
disables all OpenMP stuff so that people can get the old behavior. I worry that adding OpenMP could open up a can of worms on certain platforms. Perhaps we should keep it out of the master branch until we've had a lot of time to test it out.
I'm also on Ubuntu. I have a pretty limited understanding of cores, threads, and hyperthreads but it seems like going beyond 2 threads on my CPU gets into hyperthreading, and you aren't expected to see much performance boost from that. I have access to another linux machine with many more cores, so I'll compile it on that and see if I can get performance to scale more.
Looks like you can enable openMP conditionally pretty easily:
#pragma omp parallel for if(condition_holds)
for(...) {
}
The other issue is if this code would compile without issues on CRAN/Windows/Mac. For example Dirk warns about this in section 2.10.3 here:
http://dirk.eddelbuettel.com/code/rcpp/Rcpp-FAQ.pdf
This might be a little out of date re: MacOS openMP support, but even if it's possible to compile on MacOS, that doesn't mean it will work easily for R packages. Unfortunately MacOS is the one platform I have no way to test on.
I agree this is best kept in a branch for the time being. Maybe it could be worked on together with modernizing the interface with Rcpp that I mentioned a while back.
On a machine with 4 cores and 2 threads per core, testing 1, 2, 4, and 8 threads:
Unit: milliseconds
expr min lq mean median uq max neval
m1 637.3505 637.6977 639.8118 640.2461 640.2783 643.4865 5
m2 367.2343 369.1187 370.4815 369.8476 370.8641 375.3429 5
m4 231.8589 233.7292 237.0996 236.9478 239.3805 243.5815 5
m8 218.4185 219.7414 231.8048 220.5617 221.1067 279.1956 5
Continued speed-up with more cores, and as before, going beyond the number of cores doesn't speed things up as much.
I'm working on this here. For example, code for pcount. Even after reading the R-ext guidelines it's not totally clear to me if the #pragma
statement itself needs to be protected in a _OPENMP
conditional block. The #pragma
should be ignored when compiling if OpenMP is unavailable, perhaps with a warning. We can always duplicate the loop code, with one loop inside the block, but it would be nice to avoid that.
I looked at the source code for a few of the distribution functions and I think they're relatively thread-safe. Technically, Rf_dpois
and the like can throw warnings under unusual circumstances, for example if lambda < 0
. If we've written the likelihood correctly this should never happen. If this is still a concern we could also use the Rf_*_raw
versions of the distribution functions, which don't generally do input sanity checks. As further evidence, after asking this question, Murray Efford seems to still be using distribution functions from the R APi in parallel C++ code in secr
.
On the other hand, the random number generators in the R API are definitely NOT thread safe from what I've seen, which makes sense.
The pcount
code looks good to me, and based on what you said, it's fine with me if we go ahead and assume that the density functions are thread-safe. Would be nice to find someone with a Mac to test it. Do you see any warnings from R CMD build --as-cran
?
One thing I wonder about is the importance of setting OMP_PLACES
and OMP_PROC_BIND
, especially when running OpenMP code on multiple cores with the parallel package. I've been doing this on another project, and it seems like more than one OpenMP chunk can land on the same thread. This might happen in unmarked if people are using parboot
in parallel. To be safe, I've been setting:
OMP_PLACES="threads"
OMP_PROC_BIND="spread"
but I can't say for sure if this is necessary. More info here and here.
Setting OMP_PROC_BIND="spread"
has a pretty big negative impact on performance for me when I set threads to greater than the number of cores. I'm setting it in the pragma statement with proc_bind(spread)
(see here).
I'm not sure how to properly set OMP_PLACES
variable in Rcpp, seems like it is complicated?
Regardless of where we land on the above, I wonder if we should just force-set threads=1
for any parallel operations in unmarked, eg in parboot. Intuitively to me it doesn't seem like using more threads in the likelihood calculation is going to help if multiple models are already being run in parallel.
Also worth considering: Armadillo uses openMP automatically for some computations when it's available (see http://arma.sourceforge.net/faq.html). I've not been able to determine exactly what conditions trigger this, but it appears to happen consistently for occuMulti
. If you try to layer additional parallelization on top of this, either by using openMP in C++ code or by using parallel
in R, you get much slower runtimes then if you avoided trying to run in parallel in the first place.