uxlfoundation/oneTBB

tbb::global_control does not stop spinning threads

Closed this issue · 1 comments

Summary

Reducing the max. allowed parallelism via tbb::global_control does not "stop" threads anymore. The worker threads are spinning in thread_dispatcher::process() and are eating CPU power, meaning that the execution is slower.

The problem gets triggered when executing first e.g. tbb::parallel_for_each (to "fire up" threads), then reducing the number of threads to 1 via tbb::global_control and then executing a second tbb::parallel_for_each. While the second tbb::parallel_for_each is executed, only a single core should be active. Using the debugger I also see that the business logic is executed only by a single thread. However, looking in the task manager, I see that all cores are busy. The unnecessary threads cause the code to run slower by ~50% (either because of oversubscription, or because the CPU frequency gets reduced, not sure).

Version

Version 2021.10.0 was ok. Version 2021.11.0 is broken. The current master (55bf2b3) is still broken.

I was able to bisect the problem to commit c456844 (pull request: #758).

Environment

  • Windows 11
  • Intel Core i9-13900
  • Microsoft Compiler version 19.40 (_MSC_FULL_VER=194033811)

Observed Behavior

100% CPU load (all cores are busy) even though tbb::global_control was used to reduce the max. allowed parallelism to 1.

Expected Behavior

After setting the max. allowed parallelism to 1 via tbb::global_control, only a single core should be busy (corresponds to ~3% CPU load in the task manager because of 32 virtual cores of my Intel i9-13900).

Steps To Reproduce

  • Code:
#include <chrono>
#include <iostream>
#include <numeric>
#include <tbb/global_control.h>
#include <tbb/parallel_for_each.h>
#include <tbb/version.h>
#include <vector>

int main()
{
  std::cout << "Start. TBB: " << TBB_VERSION_STRING << ", TBB_runtime_version=" << TBB_runtime_version()
            << ", TBB_runtime_interface_version=" << TBB_runtime_interface_version() << ", MSVC: " << _MSC_FULL_VER
            << std::endl;

  //------------------------------------------------
  // First call of tbb::parallel_for_each() to 'create' threads
  static constexpr bool TRIGGER_BUG = true;
  if (TRIGGER_BUG) {
    std::cout << "Warmup to trigger bug" << std::endl;
    std::vector<double> args(1024, 42.0);
    tbb::parallel_for_each(args, [](double & arg) {});
    std::cout << "Warmup finished." << std::endl;
  }

  //------------------------------------------------
  // Reduce number of threads
  std::cout << "Running test" << std::endl;
  static constexpr size_t NUM_CORES = 1;
  tbb::global_control tbbControl(tbb::global_control::max_allowed_parallelism, NUM_CORES);

  //------------------------------------------------
  // Second call of tbb::parallel_for_each()
  std::vector<double> args(1024, 42.0);
  auto const startTime = std::chrono::high_resolution_clock::now();
  tbb::parallel_for_each(args, [](double & arg) {
    for (size_t i = 0; i < 1000000; ++i) {
      arg += std::sin(arg);
    }
  });

  double const elapsed
      = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - startTime)
            .count()
        / 1000.0;

  double const result = std::accumulate(args.begin(), args.end(), 0.0);
  std::cout << "Used " << NUM_CORES << " cores. Finished in " << elapsed << "s: " << result << std::endl;
}
  • Built using cmake:
cmake -DCMAKE_INSTALL_PREFIX="path\to\install\dir" -DTBB_TEST=OFF ..
cmake --build .
cmake --install .
  • Execute and observe in the task manager that all cores are busy. (That is the problem). Also take note of the execution time.
  • Then either revert to an older TBB version, or set TRIGGER_BUG=false. Then run again. Result: Only a single core is busy, and the execution time dropped by ~50%.

Hi @Sedeniono, that you for the report. I'm actually surprised this bug was not reported sooner :)