robbert-harms/MDT

Threading when using CPUs

celstark opened this issue · 4 comments

We've been using MDT on several workstations with GPUs in them happily. But, we now are shifting over to a cluster setup that does not have GPUs but does have nodes with 144 CPU cores.

Singularity> mdt-list-devices -l
Device 0:
===========================================================================
<pyopencl.Platform 'Portable Computing Language' at 0x7f43d8000020>
===========================================================================
extensions: cl_khr_icd
name: Portable Computing Language
profile: FULL_PROFILE
vendor: The pocl project
version: OpenCL 1.2 pocl 1.1 None+Asserts, LLVM 6.0.0, SPIR, SLEEF, DISTRO, POCL_DEBUG

---------------------------------------------------------------------------
<pyopencl.Device 'pthread-Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz' on 'Portable Computing Language' at 0x2e20aa0>
---------------------------------------------------------------------------
...
max compute units: 144

Yet, when running here, I sit with ~1 CPU fully loaded and that's about it when running mdt-model-fit. I'm polling the CPU usage and the highest I've seen is 113%. My fans aren't even getting a workout.

I'd thought that with OpenCL it would fork this out to all the CPU cores I have and yet I'm seemingly running only single-threaded. Any ideas?

Craig

Hi Craig,

I see you are using the POCL library. So far I haven't been very successful with POCL. Some kernels compile, some don't. I think what you are seeing is that POCL is still compiling one of the compute kernels. This compilation is single threaded. Only after that is done the actual computation begins which is always multi-threaded.

It has been a while since I last checked in on the POCL project. Maybe newer versions have improved compilation.

For now, could you try installing the Intel OpenCL CPU drivers? This worked well for me in the past and they should work with MDT.

Best,

Robbert

OK, took some digging, but I got the OpenCL Intel driver in place and I managed to capture it using 14400% of CPU -- so all 144 cores. 15.8s of real time and 11m32s of user time. Dang that's some nice parallel work!

FWIW, I just benchmarked one run on a GTX 1080 at 1m7s start to finish and 26s on one of these 144-CPU boxes.

Hi Craig,

I am happy to hear that you solved it. Could you share your end solution? This might help others in a similar situation.

Best,

Robbert