ashvardanian/less_slow.cpp

Memory issue with `sorting_with_executors`

alexbarev opened this issue ยท 5 comments

  1. Run sorting_with_executors benchmark using the std::execution::par_unseq policy.

  2. Memory consumption quickly exceeds my machine's availability of 60GB after the second test variant finishes: sorting_with_executors/par_unseq/4194304/

Observations:

  1. Memory does not decrease between tests with inputs of different sizes.

  2. I tried moving the policy inside the loop, but it did not resolve the issue:

    for (auto _ : state) {
        auto local_policy = std::execution::par_unseq; // Create policy here
        std::reverse(local_policy, array.begin(), array.end());
        std::sort(local_policy, array.begin(), array.end());
        bm::DoNotOptimize(array.size());
        bm::DoNotOptimize(local_policy); // even tried this just in case
    }
  3. I noticed that only 16 threads are spawned, which matches the number of available CPU's. The issue is puzzling, and I am yet to understand its root cause.

Turns out, itโ€™s a known issue.

Also relates to this issue.

@ashvardanian I reproduced the issue with a manually built oneTBB.
This code compiled with -DPAR increases memory consumption, reaching 6GB at its peak, after which the script finishes. The increase happens slowly (>1min) with O0 but very quickly (few seconds) with O3.

WSL Ubuntu 22.04, gcc version 13.1.0. Latest oneTBB/master built from source.

NB. Same issue outside of WSL.

#include <vector>
#include <cstdint>
#include <numeric>
#include <execution>
#include <algorithm>
#include <oneapi/tbb.h>

int main() {
    int count = 100000;
    std::vector<std::uint32_t> array(count);
    std::iota(array.begin(), array.end(), 1u);

#ifndef PAR
    auto policy = std::execution::seq;
#else
    auto policy = std::execution::par_unseq;
#endif

    for (int i = 0; i < 100000; ++i) {
        std::sort(policy, array.begin(), array.end());
    }

}

In contrast, using -fopenmp with this code runs without a noticeable increase in memory consumption.

#include <vector>
#include <cstdint>
#include <numeric>
#include <execution>
#include <algorithm>

int main() {
    int count = 100000;
    std::vector<std::uint32_t> array(count);
    std::iota(array.begin(), array.end(), 1u);

#pragma omp parallel for
    for (int i = 0; i < 100000; ++i) {
        std::sort(array.begin(), array.end());
    }

}

@ashvardanian The problem lies with libstdc++.

We are waiting for a fix here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117276

Currently, one can fix it locally by modifying the libstdc++ files at:
/usr/include/c++/13/pstl/parallel_backend_tbb.h

This works for me

 class __task : public tbb::detail::d1::task
 {
   protected:
@@ -646,10 +646,15 @@
 
         _PSTL_ASSERT(__parent != nullptr);
         _PSTL_ASSERT(__parent->_M_refcount.load(std::memory_order_relaxed) > 0);
-        if (--__parent->_M_refcount == 0)
+        auto __refcount = --__parent->_M_refcount;
+
+        // Placing the deallocation after the refcount decrement allows another thread to proceed with tree
+        // folding concurrently with this task cleanup.
+        __alloc.deallocate(this, *__ed);
+
+        if (__refcount == 0)
         {
             _PSTL_ASSERT(__next == nullptr);
-            __alloc.deallocate(this, *__ed);
             return __parent;
         }

Found solution here uxlfoundation/oneTBB#1533

@alexbarev, how about we add a ->Iterations(1) to that benchmark with a link to this issue for now?