uxlfoundation/oneTBB

Some tests are apparently not meant to be executed on a system with 1 CPU

Opened this issue · 5 comments

Hello. While building the Debian package for onetbb (version 2021.13.0) on a system with only one CPU, I noticed that some tests never return. This results in a build log like this when the package is built with sbuild:

[...]
 26/136 Test  #26: test_enumerable_thread_specific ..........   Passed    0.06 sec
        Start  27: test_concurrent_queue
 27/136 Test  #27: test_concurrent_queue ....................   Passed    0.24 sec
        Start  28: test_resumable_tasks
 28/136 Test  #28: test_resumable_tasks .....................   Passed   82.19 sec
        Start  29: test_mutex
 29/136 Test  #29: test_mutex ...............................   Passed    0.04 sec
        Start  30: test_function_node
E: Build killed with signal TERM after 90 minutes of inactivity
--------------------------------------------------------------------------------

If it's really the case that those tests are not meant to be executed on single-CPU systems, it would be worth to execute them conditionally, like this:

--- a/test/CMakeLists.txt
+++ b/test/CMakeLists.txt
@@ -417,8 +417,10 @@
     tbb_add_test(SUBDIR tbb NAME test_concurrent_queue DEPENDENCIES TBB::tbb)
     tbb_add_test(SUBDIR tbb NAME test_resumable_tasks DEPENDENCIES TBB::tbb)
     tbb_add_test(SUBDIR tbb NAME test_mutex DEPENDENCIES TBB::tbb)
-    tbb_add_test(SUBDIR tbb NAME test_function_node DEPENDENCIES TBB::tbb)
-    tbb_add_test(SUBDIR tbb NAME test_multifunction_node DEPENDENCIES TBB::tbb)
+    if(SYSTEM_CONCURRENCY GREATER 1)
+        tbb_add_test(SUBDIR tbb NAME test_function_node DEPENDENCIES TBB::tbb)
+        tbb_add_test(SUBDIR tbb NAME test_multifunction_node DEPENDENCIES TBB::tbb)
+    endif()
     tbb_add_test(SUBDIR tbb NAME test_broadcast_node DEPENDENCIES TBB::tbb)
     tbb_add_test(SUBDIR tbb NAME test_buffer_node DEPENDENCIES TBB::tbb)
     tbb_add_test(SUBDIR tbb NAME test_composite_node DEPENDENCIES TBB::tbb)
@@ -442,7 +444,9 @@
     tbb_add_test(SUBDIR tbb NAME test_tagged_msg DEPENDENCIES TBB::tbb)
     tbb_add_test(SUBDIR tbb NAME test_overwrite_node DEPENDENCIES TBB::tbb)
     tbb_add_test(SUBDIR tbb NAME test_write_once_node DEPENDENCIES TBB::tbb)
-    tbb_add_test(SUBDIR tbb NAME test_async_node DEPENDENCIES TBB::tbb)
+    if(SYSTEM_CONCURRENCY GREATER 1)
+        tbb_add_test(SUBDIR tbb NAME test_async_node DEPENDENCIES TBB::tbb)
+    endif()
     tbb_add_test(SUBDIR tbb NAME test_input_node DEPENDENCIES TBB::tbb)
     tbb_add_test(SUBDIR tbb NAME test_profiling DEPENDENCIES TBB::tbb)
     tbb_add_test(SUBDIR tbb NAME test_concurrent_queue_whitebox DEPENDENCIES TBB::tbb)

On the other hand, it may also be the case that the tests are actually meant to be executed everywhere and they are just buggy. I don't know.

I discovered this by using single-cpu virtual machines in the cloud, but it may also be reproduced easily by setting GRUB_CMDLINE_LINUX="nr_cpus=1".

Thanks.

Hi @sanvila we regularly run our tests in single threaded environment so I would expect them passing in our environment too.
Could you please run ctest --timeout 360 --output-on-failure to so we can have a complete log of failing tests?

Sure. This is the outcome:

98% tests passed, 3 tests failed out of 137

Total Test time (real) = 1259.61 sec

The following tests FAILED:
         30 - test_function_node (Timeout)
         31 - test_multifunction_node (Timeout)
         55 - test_async_node (Timeout)

Note: I had to do this with version 2021.12.0 because I had some problems with sphinx not related to this issue, but the outcome matches what I figured out without ctest for version 2021.13.0.

@dnmokhov or @kboyarinov could you please take a look?

@sanvila, could you please tell us more about the CPU you are running on? It would help to better reproduce and identify any possible issues.
My guess is that failures in this test can be caused by huge amount of test cases covered by these tests and most of them are massively concurrent and there can be not much time to finish everything during the timeout defined by ctest. But we definitely need to double-check this on the same system as yours.

It's an AWS instance of type r7a.medium, which has only one vCPU. This vCPU is a 4th Generation EPYC processor from AMD. It's not particularly slow, but it's a single CPU. These are the specs:

https://aws.amazon.com/es/ec2/instance-types/r7a/

and this is the full contents of /proc/cpuinfo:

processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 25
model		: 17
model name	: AMD EPYC 9R14
stepping	: 1
microcode	: 0xa101148
cpu MHz		: 3700.089
cache size	: 1024 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 16
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core invpcid_single ssbd perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr rdpru wbnoinvd arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid flush_l1d
bugs		: sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass srso ibpb_no_ret
bogomips	: 5200.00
TLB size	: 3584 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:

If you really can't reproduce the issue using GRUB_CMDLINE_LINUX="nr_cpus=1" and rebooting, I would be willing to provide a virtual machine for you.