Unclear documentation on parallel_for granularity argument

Question

Unclear documentation on parallel_for granularity argument

wheatman opened this issue 2 years ago · 2 comments

I recently ran into a bug that traced back to a bad use of the granularity argument in the parallel_for. I expected that if I gave an argument of k, then at least k iterations of the loop would be run serially, which I was using to ensure that no parallel threads would operate on two different elements within k of each other. However, from looking at the code this is not what it means, it recursively breaks up the range into smaller regions and stops breaking it up when it gets smaller than the granularity which means that this cannot be used for parallel correctness.

The documentation in the code currently states

// parallel loop from start (inclusive) to end (exclusive) running
// function f.
//    f should map size_t to void.
//    granularity is the number of iterations to run sequentially
//      if 0 (default) then the scheduler will decide
//    conservative uses a safer scheduler

I would either add a warning that this is not guaranteed or add some word like approximate.

On the website the documentation states

Granularity
The granularity of a parallel-for loop specifies the number of iterations of the loop that should be ran sequentially. For example, a loop with 100K iterations and a granularity of 1000 would execute 100 parallel tasks, each of which was responsible for evaluating 1000 iterations sequentially.

Which again would indicate that I could rely on this granularity, so I believe that this should be specified as well.

As a separate note, from reading the code it looks like you could get a serial region as small as 7/16th of the size that you specify

  template <typename F>
  void parfor_(size_t start, size_t end, F f, size_t granularity, bool conservative) {
    if ((end - start) <= granularity)
      for (size_t i = start; i < end; i++) f(i);
    else {
      size_t n = end - start;
      // Not in middle to avoid clashes on set-associative caches on powers of 2.
      size_t mid = (start + (9 * (n + 1)) / 16);
      pardo([&]() { parfor_(start, mid, f, granularity, conservative); },
            [&]() { parfor_(mid, end, f, granularity, conservative); },
            conservative);
    }
  }

As an additional confusing point I had understood the regions to be exact steps, for example if I did 100K iterations and a granularity of 1000 would execute 100 parallel tasks it would be equivalent to

parallel for (int i_ = 0; i_ < 100000; i_+=1000) {
    for (int i = i_; i<i_+1000; i++) {
        f(i);
    }
}

However the way the regions are broken up means that this is not the case.

I think the documentation could be clarified that so that other people don't run into similar issues.

Answer 1 · 2023-03-16T18:53:13.000Z

Thanks for the comments. We will fix the documentation. The 9/16 is important for performance in some cases since otherwise if a user picks a power of 2 (e.g. 1024 it is terrible due to the set associative caches---cores will collide on the same set). It is similar to the problem of accessing a column of a matrix who's row length is a power of 2. Even using a stride of 8 might be problematic. Guy

…

On Thu, Mar 16, 2023 at 10:08 AM wheatman ***@***.***> wrote: I recently ran into a bug that traced back to a bad use of the granularity argument in the parallel_for. I expected that if I gave an argument of k, then at least k iterations of the loop would be run serially, which I was using to ensure that no parallel threads would operate on two different elements within k of each other. However, from looking at the code this is not what it means, it recursively breaks up the range into smaller regions and stops breaking it up when it gets smaller than the granularity which means that this cannot be used for parallel correctness. The documentation in the code currently states // parallel loop from start (inclusive) to end (exclusive) running // function f. // f should map size_t to void. // granularity is the number of iterations to run sequentially // if 0 (default) then the scheduler will decide // conservative uses a safer scheduler I would either add a warning that this is not guaranteed or add some word like approximate. On the website the documentation states Granularity The granularity of a parallel-for loop specifies the number of iterations of the loop that should be ran sequentially. For example, a loop with 100K iterations and a granularity of 1000 would execute 100 parallel tasks, each of which was responsible for evaluating 1000 iterations sequentially. Which again would indicate that I could rely on this granularity, so I believe that this should be specified as well. As a separate note, from reading the code it looks like you could get a serial region as small as 7/16th of the size that you specify template <typename F> void parfor_(size_t start, size_t end, F f, size_t granularity, bool conservative) { if ((end - start) <= granularity) for (size_t i = start; i < end; i++) f(i); else { size_t n = end - start; // Not in middle to avoid clashes on set-associative caches on powers of 2. size_t mid = (start + (9 * (n + 1)) / 16); pardo([&]() { parfor_(start, mid, f, granularity, conservative); }, [&]() { parfor_(mid, end, f, granularity, conservative); }, conservative); } } As an additional confusing point I had understood the regions to be exact steps, for example if I did 100K iterations and a granularity of 1000 would execute 100 parallel tasks it would be equivalent to parallel for (int i_ = 0; i_ < 100000; i_+=1000) { for (int i = i_; i<i_+1000; i++) { f(i); } } However the way the regions are broken up means that this is not the case. I think the documentation could be clarified that so that other people don't run into similar issues. — Reply to this email directly, view it on GitHub <#48>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABTIVAENCWET4A4WPWF2UCLW4MNGZANCNFSM6AAAAAAV5IOBJE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 2 · 2023-03-20T01:47:00.000Z

Thanks for the suggestion. I've improved the documentation to better describe the granularity. If you want to divide a range into exactly sized blocks in parallel, you might be interested in the blocked_for function instead.