kenba/opencl3

Support running multiple instances of the same kernel simultaneously

Closed this issue · 7 comments

kenba commented

@vmx has created a couple of PRs to add Clone traits: #4 and #7 and another PR (#6) to add Send and Sync traits.

I'm not happy with these PR's, I don't think that they take the correct approach.

Cloning

The first issue is what should Clone` do?

@vmx has implemented Clone using the appropriate OpenCL clRetain* functions.

According to the OpenCL spec, the clRetain* functions, increment an object's reference count, i.e. they perform a shallow copy. This use of Clone is similar to using shared pointers in C++ and nowhere near as powerful as Rust references.

Most opencl3 objects are immutable after they have been constructed. Normally, only the Drop trait is mutable to implement RAII by calling the relevant clRelease* function. Therefore, the opencl3 objects can (and should) be accessed by immutable Rust references wherever possible. The exception is Context, where: sub-devices, command queues, and programs are added to a Context in the Initialisation phase:

OpenCL Application Lifecycle
Figure 1 OpenCL Application Lifecycle

However, after Initialisation., Contexts can (and should) also be accessed by immutable Rust references.

Multi-threading

The Query and Initialisation phases of an application should be performed before threads are started. Once the relevant context(s) have been created they can be shared with the threads using immutable Rust references, or new command queues and kernels can be created and moved to the new threads.

Each thread should operate independently; i.e. each should have it's own: command queues, memory buffers and especially kernels.

If an OpenCL application wishes to run multiple instances of a Kernel simultaneously on different threads, then it should use different instances of the Kernel on the threads, not references to the same Kernel instance, see: OpenCL multiple host threads.

Therefore, there is definitely a need to provide multiple deep copies of a Kernel to support multi-threading.

I cannot see a need to Send or Sync OpenCL objects with the possible exception of Events.
Events can be used to synchronise OpenCL execution on multiple command queues in the same context. However, it should be possible to use immutable references or pass them from a thread after it has finished execution.

vmx commented

tl;dr: This comment is about devices and platforms. Both don't need reference counting, so I proposing implementing Copy + Clone for both of them (or at least for devices, which is what I need).

@kenba: In the Multi-threading section, you mention that multiple context might be created. Currently if you create context from a device, the context owns the device. You cannot pass on a reference.

This means that if you create two contexts from the same device, you would need a copy of that device. For me, this is what Clone trait is about. In case of devices and platform I would implement derive the Copy trait as no reference counting is needed.

With Clone implemented for devices you can write:

    let mut context = Context::from_device(device.clone()).unwrap()
    let mut context_second = Context::from_device(device).unwrap()

Without Clone it would be:

    let device_clone = Device::new(device.id());

    // Create a Context and a queue on the device
    let mut context = Context::from_device(device).unwrap()
    let mut context_second = Context::from_device(device_clone).unwrap()

which I find less ergonomic.

In my case, we create a list of devices ones at startup. So every call that needs to own a device would need that extra let device_clone = Device::new(device.id()); trick.

The same applies to platforms.

vmx commented

tl;dr: This comment is about implementing Sync and Send for devices and platforms.

As mentioned in the previous comment, in the code I'm working on, we initialize a list of devices and platform once at startup. I think that's a pattern worth supporting. But in order to be able to use things in lazy_static/once_cell, they need to be be Sync and Send.

I could create my own newtype wrappers around devices and platforms, but I'd hope that opencl3 would just implement those, so that no extra workarounds would be needed. Especially since it's safe to have device and plaform Sync and Send as they are indeed thread-safe.

vmx commented

tl;dr: This comment is about implementing Send for most of the types.

Once the relevant context(s) have been created they can be shared with the threads using immutable Rust references

I don't have much experience with multi-threaded Rust code, so I read about things and tried out what it might look like. "immutable Rust references" sounds like an Arc to me. @kenba please let me know if you meant something else. An example I came up with is this:

use cl3::device::CL_DEVICE_TYPE_ALL;

use opencl3::command_queue::CL_QUEUE_PROFILING_ENABLE;
use opencl3::context::Context;
use opencl3::device::Device;
use opencl3::platform::get_platforms;
use std::thread;
use std::sync::Arc;

fn main() {
    let platforms = get_platforms().unwrap();

    let devices = platforms[0]
        .get_devices(CL_DEVICE_TYPE_ALL)
        .expect("Platform::get_devices failed");

    let device = Device::new(devices[0]);

    let mut context = Arc::new(Context::from_device(device).expect("Context::from_device failed"));
    let context_clone = Arc::clone(&context);

    let handle = thread::spawn(move || {
        context_clone
            .create_command_queues(CL_QUEUE_PROFILING_ENABLE);
    });

    handle.join();
}

In this case I would want to do run some context on some other thread. This program fails to compile due to (I shortened the error message):

error[E0277]: `*mut c_void` cannot be sent between threads safely
   --> src/main.rs:22:18
    |
22  |     let handle = thread::spawn(move || {
    |                  ^^^^^^^^^^^^^ `*mut c_void` cannot be sent between threads safely
    | 
   ::: /home/vmx/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:617:8
    |
617 |     F: Send + 'static,
    |        ---- required by this bound in `spawn`
    |
    = help: within `opencl3::context::Context`, the trait `Send` is not implemented for `*mut c_void`
    = note: required because it appears within the type `opencl3::context::Context`
    = note: required because of the requirements on the impl of `Send` for `Arc<opencl3::context::Context>`
    = note: required because it appears within the type `[closure@src/main.rs:22:32: 25:6]`

The problem is that opencl3::context::Context doesn't implement Send.

Implementing Send for most of the types means that it would be possible to move something into another thread. I think that's very useful and something worth supporting. The real world case we have is using Rayon, which moves things between threads as it likes, hence also needs Send implemented.

kenba commented

@vmx you raise a very good point about Context::from_device consuming Device. I shall change it so that it takes a reference to a device instead.

With regards to your views on Clone, Send and Sync they are completely different to mine. Which are that Clone, Send and Sync are only required for mutable data and objects. Most of the types in this library can (and should) be immutable and used by taking references to them, see: Thinking outside the synchronisation quadrant.

In my understanding an OpenCL application should only create one context per platform, whether running multi or single threaded. After a Context has been initialised it can be considered "read-only", i.e. after the Programs, Kernels and CommandQueues have been built, e.g. at line 84 in integration_test.rs or line 113 of opencl2_kernel_test.rs. All of the methods to get Kernels, CommandQueues, Buffers and creating svm memory, take an immutable reference to a Context. References to Contexts could be passed to other threads, instead of having them create their own Contexts.

However, it would be better for other threads to be able to take their own copies of CommandQueues and especially Kernels from Context references, that way they can work independently.

The one object that you may want to copy and send between threads is an Event, as stated above.

I don't have time to change this now, I'll have a look at it at the weekend.

Please let me know your thoughts Volker.

vmx commented

With regards to your views on Clone, Send and Sync they are completely different to mine. Which are that Clone, Send and Sync are only required for mutable data and objects.

I agree that as much immutability should be used as possible. Those those traits are not about mutability. They are about ownership. Let's take Send as an example. It used to signal that an object can safely be moved between threads, which is what I am after.

The Rust API Guidelines mention in the interoperability section, specifically that you should implement Copy and Clone eagerly, which I think makes the case for Device and Platform. The guidelines also say that Send and Sync should be implemented where possible.

References to Contexts could be passed to other threads, instead of having them create their own Contexts.

As I mentioned in my previous comment, I don't understand how you would do that without have Send implemented on Context. Can you please give an example in how you give another thread a reference to a Context?

I don't have time to change this now, I'll have a look at it at the weekend.

Take your time. Thanks for communicating the timeline so clearly, that's helpful. I've the luxury of being able to comment during my work hours.

kenba commented

@vmx I've been able to implement Copy, Clone, Send and Sync traits for Platform and Device.
However, I need to redesign Kernel and Context to implement their lifetimes properly and I didn't have time this weekend, but hopefully I will next weekend.

kenba commented

I've thoroughly considered sharing OpenCL objects across threads and come to the conclusion that there is a fundamental issue with OpenCL object lifetimes which makes sharing OpenCL objects derived from a Context unsafe across threads.

Object lifetimes are one of Rust's great strengths but they are rarely, if ever, considered when using other programming languages. Unfortunately, the OpenCL API does not explicitly state the lifetimes of OpenCL objects. It's fairly clear that Platforms and Devices don't have lifetimes, they can be expected to be around until the end of the program. However, the lifetimes of OpenCL contexts and the objects that are based them are less clear.

I consider that all items created from an OpenCL Context should not outlive the Context that they were created from. That is why Programs, Kernels, Sub Devices and Command Queues are contained in the Context that they were created from and why shared virtual memory has a lifetime determined by a reference to a Context. Rust references help to enforce these lifetimes, but they cannot be used across threads..

@vmx Platform and Device can be safely copied (and sent and synced) between threads now, so you can use them to create independent Contexts in different threads. However, I don't believe that you should be able to clone objects such as Command Queues between threads, so I have reverted RP #4.

If an OpenCL expert can explain to me how and why OpenCL objects such as Programs Kernels and Command Queues outlive their Contexts, then I will be happy to revisit this issue. But in the meantime, I'm not extending support for the Clone, Send or Sync traits.