r-spatialecology/landscapemetrics

Parallel rcpp code

Closed this issue ยท 29 comments

hey yall,

just had a quick look on how to parallelize the rcpp stuff and discussed it with Sebastian.

I think the only thing we would have to change is a single line in the Rcpp code of most of the functions and maybe provide another argument to the function (the number of cores to be used, could be defaulted to 1).

Have a look at rcpp_get_coocurrence_matrix and tell me what you think (the only change is the #pragma ... line):

#include "rcpp_get_coocurrence_matrix.h"
#include "rcpp_create_neighborhood.h"
#include "rcpp_get_unique_values.h"
#include "get_class_index_map.h"
#include <omp.h>

// [[Rcpp::export]]
IntegerMatrix rcpp_get_coocurrence_matrix(const IntegerMatrix x,
                                          const arma::imat directions
                                          const int num_cores) {
 
    const int na = NA_INTEGER;
    const unsigned ncols = x.ncol();
    const unsigned nrows = x.nrow();

    std::vector<int> classes = rcpp_get_unique_values(x);
    std::map<int, unsigned> class_index = get_class_index_map(classes);

    unsigned n_classes = class_index.size();
    std::vector<std::vector<unsigned> > cooc_mat(n_classes,
                                                 std::vector<unsigned>(n_classes));

    // create neighbors coordinates
    IntegerMatrix tmp = rcpp_create_neighborhood(directions);
    int neigh_len = tmp.nrow();
    std::vector<std::vector<int> > neig_coords;
    for (int row = 0; row < neigh_len; row++) {
        IntegerVector a = tmp.row(row);
        std::vector<int> b(a.begin(), a.end());
        neig_coords.push_back(b);
    }

    // NAs need an index, otherwise they are counted as neighbors of class[0]
    class_index.insert(std::make_pair(na, n_classes));

    #pragma omp parallel for num_threads(num_cores)
    for (unsigned col = 0; col < ncols; col++) {
        for (unsigned row = 0; row < nrows; row++) {
            const int tmp = x[col * nrows + row];
            if (tmp == na)
                continue;
            unsigned focal_class = class_index[tmp];
            for (int h = 0; h < neigh_len; h++) {
                int neig_col = neig_coords[h][0] + col;
                int neig_row = neig_coords[h][1] + row;
                if (neig_col >= 0 &&
                        neig_row >= 0 &&
                        neig_col < ncols &&
                        neig_row < nrows) {
                    const int tmp = x[neig_col * nrows + neig_row];
                    if (tmp == na)
                        continue;
                    unsigned neig_class = class_index[tmp];
                    cooc_mat[focal_class][neig_class]++;
                }
            }
        }
    }

    IntegerMatrix result(n_classes, n_classes);
    for (unsigned col = 0; col < cooc_mat.size(); col++) {
        for (unsigned row = 0; row < cooc_mat[col].size(); row++) {
            result(col, row) = cooc_mat[col][row];
        }
    }

    // add names
    List u_names = List::create(classes, classes);
    result.attr("dimnames") = u_names;
    return result;
}

Again, I would argue for having a new function, e.g. rcpp_get_coocurrence_matrix_parallel()

@mhesselbarth This would produce much redundant code that may diverge over time and make it complicated to maintain the code. I see the benefit of having fewer function arguments for the external functions, but I would argue to have no redundant code for the internal functions.

That might be true as well...I really don't like the idea of inbuild/internal kind of "hidden" parallelization. Defenitly something we need to carefully think about.

The question would be how to access the internal argument of the Rcpp functions then ๐Ÿค”

What I am thinking about is:

I have a very large raster and I just want to calculate lsm_p_perim for that single raster.
Currently, this runs sequential - however, as most of our C++ is just plain for loops we could very easily parallelize that part of our code (see the first post).

I discussed that with Max a bit and we came up with two possible solutions (a new parallel function for each metric was none of them) how to solve and implement that:

  1. We can introduce another global option for the package. This would be minimally invasive to the package, as we wouldn't need to introduce further function arguments. However, this lacks the visibility to users that download the package just from CRAN (and wouldn't see a vignette on the homepage or whatsoever).

  2. We introduce a new function argument to every metric that relies on Rcpp code (e.g. num_cores). This would mean we have to do that in quite a couple of functions but would be very visible to the user as it would be prominently placed in the documentation of each function.

Any thoughts on that @Nowosad, @bitbacchus?

We could introduce a wrapper function that calls the RCpp functions with the argument like so:\

# calculate all metrics on patch level (single threaded)
calculate_lsm(landscape, level = "patch")

# calculate all metrics on patch level (multi threaded)
calculate_parallel_lsm(landscape, level = "patch")

But we still would need one of the two points I mentioned to control the single metrics, for example lsm_p_perim(landscape, num_cores = 16).

Exactly!

So, there is a parallel version of rcpp_get_coocurrence_matrix now in this branch:

https://github.com/r-spatialecology/landscapemetrics/blob/rcpp_parallel/src/rcpp_get_coocurrence_matrix.cpp

... it works, and is a lot faster.

However, the results if one uses more than one core are ... fluctuating ๐Ÿ˜†
I have no idea why, but these lines should make it clear.

Any idea why @bitbacchus? Could be caused by openmp being used to iterate over rows and cols, instead of just a single for loop ... Probably related to the simd section here.

I toyed around a bit on my laptop (i7 dual core with hyperthreading = 4 threads):
benchmark

ncores = 0 is the original function, without any OMP; ncores 1-4 are the number of CPU cores used.
@marcosci is this what you mean with fluctuating?

My interpretation:

  • the original function is faster than single-core OpenMP, because OpenMP prevents some optimizations by the compiler
  • more than two cores are not much faster, because of the limited CPU caches (I'll check on a machine with more cache)
  • dynamic scheduling scales a bit better for three cores, because the number of columns is not dividable by three

Overall, the differences are negligible. Also, the speed gains are negligible, for this raster size.

benchmark
Here I tested a larger raster (Charlottes tif) on a 8-core virtual machine.

I think the problem is not that the runtime is fluctuating but the results...which is kind of a problem ๐Ÿ˜„

Can you show what you did there Sebastian?

On my computer it's substantially faster:

> bench::mark(
+     cores_1 = rcpp_get_coocurrence_matrix(mat, four),
+     cores_4 = rcpp_get_coocurrence_matrix_par(mat, four, 4),
+     cores_8 = rcpp_get_coocurrence_matrix_par(mat, four, 8),
+     iterations = 1000,
+     check = FALSE)
# A tibble: 3 x 14
  expression      min     mean   median      max `itr/sec` mem_alloc  n_gc n_itr total_time result          memory             time     gc                  
  <chr>      <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl> <int>   <bch:tm> <list>          <list>             <list>   <list>              
1 cores_1     180.1ms  198.8ms  201.6ms  263.6ms      5.03    3.42KB     0  1000      3.31m <int [15 ร— 15]> <Rprofmem [2 ร— 3]> <bch:tm> <tibble [1,000 ร— 3]>
2 cores_4      53.8ms   54.8ms   54.4ms     68ms     18.2     3.42KB     0  1000      54.8s <int [15 ร— 15]> <Rprofmem [2 ร— 3]> <bch:tm> <tibble [1,000 ร— 3]>
3 cores_8      38.1ms   40.5ms   40.1ms   52.2ms     24.7     3.42KB     0  1000     40.47s <int [15 ร— 15]> <Rprofmem [2 ร— 3]> <bch:tm> <tibble [1,000 ร— 3]>

The only problem is that the results are different if I use more than 1 core (totally random each run, but the difference is quite small to our correct result).

I think the problem is not that the runtime is fluctuating but the results...which is kind of a problem

Oh, I didn't realize this... I'll have a look.

Should work now, please check.

The problem was, that multiple threads wrote to cooc_mat at the same time (i.e. racing condition problem). To prevent this from happening in the future, I'd suggest using default(none) in the parallel pragma to avoid any implicitly shared objects and add all needed objects explicitly with shared() (see code).

Didn't even see default(none) ... only static and dynamic. Do you have any good ressource for OMP?
But nice! Then I can go ahead and implement it for the rest - if the design decision is settled?

Just a quick recap:

  • OMP allows us to parallelize our Rcpp code without any further dependency
  • The default is still single threaded

We just have to decide which of the two options we want:

  1. We can introduce another global option for the package. This would be minimally invasive to the package, as we wouldn't need to introduce further function arguments. However, this lacks the visibility to users that download the package just from CRAN (and wouldn't see a vignette on the homepage or whatsoever).
  2. We introduce a new function argument to every metric that relies on Rcpp code (e.g. num_cores). This would mean we have to do that in quite a couple of functions but would be very visible to the user as it would be prominently placed in the documentation of each function.

Didn't even see default(none) ... only static and dynamic. Do you have any good ressource for OMP?

Yes, but I don't find it anymore ;-)
This looks good: https://bisqwit.iki.fi/story/howto/openmp/

Great work! I am very impressed.

We just have to decide which of the two options we want:

  1. We can introduce another global option for the package. This would be minimally invasive to the package, as we wouldn't need to introduce further function arguments. However, this lacks the visibility to users that download the package just from CRAN (and wouldn't see a vignette on the homepage or whatsoever).
  2. We introduce a new function argument to every metric that relies on Rcpp code (e.g. num_cores). This would mean we have to do that in quite a couple of functions but would be very visible to the user as it would be prominently placed in the documentation of each function.

Is is possible to do both? If not, I would vote for 2.

Perfect, that was our tendency, too!

I'm also in favor for the second option and adding an additional global option shouldn't be a big problem as well

๐Ÿฅ‡

Both would be cool!

To-Do:

  • rcpp_get_coocurrence_matrix
  • rcpp_get_coocurrence_vector
  • rcpp_get_coocurrence_matrix_diag
  • rcpp_get_circle
  • rcpp_get_nearest_neighbour
  • rcpp_get_boundaries

... I already finished the first 2 points, maybe some of you can have a look at it and correct me before I do the rest:

https://github.com/r-spatialecology/landscapemetrics/tree/rcpp_parallel

I'd suggest having a default 'n_cores' in the C++ code, as well. This makes the code more backward compatible. I also tested the performance on a server with 32 cores
Rplot

It's interesting, to see the trade-off between the overhead of creating more threads and the speed-up due to parallelization. We might want to think about Cuda in the future, as there is no overhead for more threads (afaik).

Somehow I couldn't compile when I used a default argument in the rcpp_get_coocurrence_matrix function ๐Ÿค” ๐Ÿคทโ€โ™‚๏ธ

And CUDA would mean another dependency, or? And changing the code?

Somehow I couldn't compile when I used a default argument in the rcpp_get_coocurrence_matrix function

Where did you set the default? In the header or source file? You can't do it in both, I'd put int in the header.

And CUDA would mean another dependency, or? And changing the code?

Yes. It actually needs also a different compiler, not sure if Rcpp still works, and a complete recode of the C++ part. Not to talk about the hardware dependency ;-)

Somehow I couldn't compile when I used a default argument in the rcpp_get_coocurrence_matrix function

Where did you set the default? In the header or source file? You can't do it in both, I'd put int in the header.

That worked like a charm and is implemented now.

And CUDA would mean another dependency, or? And changing the code?

Yes. It actually needs also a different compiler, not sure if Rcpp still works, and a complete recode of the C++ part. Not to talk about the hardware dependency ;-)

So, let's never talk about it again ๐Ÿ˜„

Some thoughts:

  • parallelization makes a lot of sense for moving window calculations, i.e. a parallel raster::focal
    • like many other raster functions, raster::focal actually supports parallelization, already. It would be cool if we just could wrap lsm_calc between a beginCluster() and endCluster to utilize this. However, I couldn't get it working...
  • it should be rather quite forward to write our own focal function. This would avoid quite a bit of overhead because its parameters would fit however we want. We cloud also have (optional) parallelization with openmp.

If I may hijack this issue on speeding up lsm calculations, I was wondering whether

a) any implementation of parallel computing was made in the landscapemetrics package in the end?

b) whether, parallel computing aside, it would be faster to calculate different metrics at once per spatial scale than to loop over each combination of metric and scale (using the scale_sample() function)? I am trying to calculate three metrics (ed, tca, cohesion) at three different spatial scales (100, 200, 400) and, in addition, for three different Rasterlayers (possibly combined in a raster stack). I run into memory issues if I try to calculate all 27 combinations of the three factors at once but am wondering whether the sampling is implemented in such a way that the moving window would only be calculated once, followed by the calculation of all three metrics, or whether the function loops over all combinations of stacked Rasterlayers, spatial scales and metrics anyways?

I am calculating the metrics for each cell in landscapes consisting of 250,000+ grid cells.

Thanks for your help.
Best

Andreas

@adietzel the parallel idea was dropped due to not having enough man power. We are currently working on some alternative performance improvements (see #302)