GPU support in Exo

Question

GPU support in Exo

skeqiqevian opened this issue a year ago · 3 comments

Wanted to preface by saying that this design is definitely not done. However, I wanted to describe the proposal based on discussions with William and get some feedback, since I'll be on vacation next week.

ExoIR representation of CUDA abstractions

CUDA has three primary abstractions that we want to support: memories, parallel hierarchy, and synchronization. We first describe how we represent these in ExoIR. Later, we will describe the necessary safety checks to prevent users from generating bad CUDA code. Ideally, we want to prevent both data races and deadlocks.

Parallel Hierarchy

We will represent parallel block/thread loops as parallel loops with special annotations (e.g. @THREADS or @BLOCK). In CUDA, these loops are always implicit because users are tasked with writing thread programs. In our programming model, we require users to explicitly write parallel loops. Users may write consecutive parallel loops, e.g. the following correspond to running threads 0-7 all in parallel:

for i in par(0, 4): @THREADS
    ...
for i in par(4, 8): @THREADS
    ...

Code generation from this programming model to the CUDA programming model is simple. Each such block/thread loop actually corresponds to an if statement which predicates over the specified loop range:

CUDA C++	Exo
if (blockIdx.x < 5) { if (threadIdx.x < 5) { ... } }	for i in par(0, 5) @ BLOCKS: for i in par(0, 5) @ THREADS: ...

Excessive generation of if statements

This approach may generate vacuously true if statements (e.g. when iterating over all threads), so we should prune those. In particular, unless there is complex block-level synchronization, all of the block-level loops will likely generate vacuously true if statements.

Memory

We will define specialized memory classes for the shared memory (SHARED_MEMORY) and thread-local registers (THREAD_REGISTER), just as we did for AVX vector registers. These memories we require additional checks:

shared memory should only be accessible by threads within the same block.
thread-local registers should only be accessible by that thread

Synchronization

We want to give users control over synchronization. Thus, it is the user's responsibility to properly insert synchronization primitives into their code. At compilation time, we will verify that the user inserted syncs properly before generating CUDA code. In CUDA code, we can perform synchronization over arbitrary predicates (like below). However in Exo, we will need to restrict ourselves to predicates of index expressions. As a design choice to avoid reasoning about complicated synchronization patterns, we choose to make synchronizations happen outside of the parallel for loops. Thus, Exo code will

CUDA C++	Exo
cuda::barrier bar; bar.init(512); if (predicate(threadIdx.x)) { bar.arrive(); }	bar: barrier[512] sync(bar, predicate)

To avoid deadlocking, we need to check that the specified number of threads arrives at the barrier for an arbitrary predicate. To start out, perhaps we should restrict the predicates to simple ranges, e.g. [lo, hi].

Safety checks

Memory safety

Our proposed programming model doesn't require an entire thread program to be in a single loop over threads, so it's possible for situations where thread-level registers persist across multiple thread loops, e.g.

reg: i32[8]
for i in par(0, 8) @ THREADS:
    # do something with reg
sync(bar)
for i in par(0, 8) @ THREADS:
    # do something with reg

Therefore, the thread registers may be allocated external to the thread loops. When that happens, the first dimension should be the number of threads. Furthermore, we need to check that each thread only reads from its own registers. We will need to do similar analysis for shared memory and @BLOCK for loops.

Parallel safety

We consider a pair of threads to be non-interfering if each thread's write set is independent of the other thread's read/write sets. Race conditions are not possible between non-interfering threads because they write to disjoint memories (they may still read from shared read-only memory). Such "embarrassingly parallel" code does not require any synchronization. Below are some examples of non-interfering parallel threads:

for i in par(0, N) @ THREADS:
    b[i] = a[i] + a[i + 1]

for i in par(0, N/2) @ THREADS:
    a[i] = 1.0
for i in par(N/2, N) @ THREADS:
    a[i] = 0.0

Exo's existing analysis for OpenMP parallelism performs this exact check. However, it currently assumes that the parallel loops exist in the outermost scope. We need to extend this approach to nested parallel loops and synchronization.

Proposed analysis

Disclaimer: I don't currently know the specifics of implementing such an analysis. I'll need to talk with Yuka and Gilbert to better understand what they are doing with Abstract Interpretation. But I think this describe the high-level of the kind of checks we need to perform.

We require users to insert synchronization into their code to break the code into sections of non-interference. The analysis needs to verify that in between synchronizations, threads are non-interfering. To do so, for each thread, we track the memory locations that it can access safely. As we iterate through the program:

For each parallel loop, we prune these access lists by the writes of the other loop iterations. In order to enforce non-interference, no two threads can write to the same location. Thus, writes make those memory locations "exclusive" to that thread.
If a thread ever tries to access a memory location outside of its access list, then the analysis should fail.
For each synchronization, we union all of the access lists of those threads, and assign each thread that list. After synchronization, each thread can safely access all memory locations that were previously exclusive to one of the threads.

Analysis Example

As an example, consider the following program:

bar: Barrier(threads=4)
for i in par(0, 4):
    a[i] = ...
sync(bar, "i < 4")
for i in par(0, 4):
    b[i] = a[i]
    if i+1 < 4:
        b[i] += a[i+1]

The analysis progression would update the accessible memory locations as follows:

Initially, all memories are accessible by all threads.

thread  memories
0       a[...], b[...]
1       a[...], b[...]
2       a[...], b[...]
3       a[...], b[...]

After first loop, the a[i]s are exclusive because they are written to.

thread  memories
0       a[0], b[...]
1       a[1], b[...]
2       a[3], b[...]
3       a[3], b[...]

After sync, all the a[i]s are no longer exclusive.

thread  memories
0       a[...], b[...]
1       a[...], b[...]
2       a[...], b[...]
3       a[...], b[...]

After second loop, none of the as are affected because those were read-only memories. However, the b[i]s are now exclusive.

thread  memories
0       a[...], b[0]
1       a[...], b[1]
2       a[...], b[2]
3       a[...], b[3]

Implementation - Not sure yet

The above analysis is doable for simple programs, but I'm less sure of how to extend it to more complicated programs with more degrees of loop nesting. Below is an example of a fairly complicated program (warp specialization) that we would want our analysis support.

It has a loop with non-constant bounds (something that block/thread loops can't have)
The synchronization fence is actually partway through the loop, so analysis needs to be able to handle non-interfering sections which may span different loop iterations.

CUDA C++

Exo

for (int i = 0; i < n_iters; i++) {
    if (threadIdx.x < 32) {
        produce(buf[i%2]);
    }
    __syncthreads();
    if (32 <= threadIdx.x) {
        consume(buf[i%2]);
    }
}

for i in seq(n_iters):
    for j in par(0, 32) @ THREADS:
        produce(buf[i%2])
    sync(0, n_threads) # producers done, consumers ready to receive
    for j in par(32, n_threads) @ THREADS:
        consume(buf[i%2])

Sidenote: Exo currently can't schedule circular buffer optimizations, which would be necessary for the software pipelining which enables this producer-consumer model.

More examples of ExoIR

Taken from CUDA C++ Programming Guide 7.26.2.

CUDA C++

Exo

__global__ void split_arrive_wait(int iteration_count, float *data) {
    using barrier = cuda::barrier<cuda::thread_scope_block>;
    __shared__  barrier bar;
    auto block = cooperative_groups::this_thread_block();

    if (block.thread_rank() == 0) {
        init(&bar, block.size()); // Initialize the barrier with expected arrival count
    }
    block.sync();

    for (int curr_iter = 0; curr_iter < iteration_count; ++curr_iter) {
        /* code before arrive */
       barrier::arrival_token token = bar.arrive(); /* this thread arrives. Arrival does not block a thread */
       compute(data, curr_iter);
       bar.wait(std::move(token)); /* wait for all threads participating in the barrier to complete bar.arrive()*/
        /* code after wait */
    }
}

bar: CUDA_BARRIER @ SHARED_MEMORY
for i in par(0, 1): @ THREADS
    init(&bar, ...)
sync(0, N)

for i in par(0, N): @ THREADS
    # code before arrive
sync(0, N)

for i in par(0, N): @ THREADS
    compute(...)
sync(0, N)

for i in par(0, N): @threads
    # code after wait

Answer 1 · 2024-01-22T03:46:32.000Z

Hi Kevin,

Here are the two big comments/questions:

Is the plan that this will be developed in a fork from Exo for the purposes of a Master's project or is this a proposal for a core Exo feature? Maybe a fork would give you more freedom to quickly start experimenting, and not get you stuck in trying to have a consistent design. For example, you could just create an "Exo-CUDA" language instead of worrying about how to externalize the CUDA programming model.
My original understanding of this project was that the idea was to explore very conservative (w.r.t. concurrency/synchronization) GPU support for Exo. The sketch here seems to be headed towards supporting synchronization primitives, which will be non-trivial because it could invalidate the soundness of all currently existing analyses/program transforms. If it doesn't, it's unclear to me exactly why it doesn't.

You may want to look at something called the Bulk Synchronous Parallel (BSP) model as a potential grounding for what you are proposing.

Here are some further comments jotted down while reading:

Your CUDA C++ and Exo labels on the first table are swapped
Why would we define a THREAD_REGISTER memory type; don't you merely want at GPU_DRAM memory?
What are the semantics of barriers, and how many other scheduling primitives are there? Can you do a version of this that doesn't use synchronization?
"Our proposed programming model doesn't require an entire thread program to be in a single loop over threads" This is inconsistent with the CUDA programming model, right? So then, are you proposing to somehow transform the whole Exo program automatically to be consistent with this idea? If so, why is that an automatic transform instead of a scheduling transform?
"(they may still read from shared read-only memory)" -- note that shared memory is not persistent across CUDA kernels. Consequently, you will never be allowed to read anything meaningful from shared memory.
"For each synchronization, we union all of the access lists of those threads, and assign each thread that list." --- why is this the right thing to do?
re: analysis example; how does this work when the loop bound is a variable rather than a constant?
why does the barrier have a memory location in the last example?

Answer 2 · 2024-01-29T16:39:38.000Z

It will be helpful to mention the implicit synchronization proposal and talk about its pros and cons (the pros will be that the analysis is simple, and the cons will be the performance compromise due to the inflexibility).
I think you want to talk about adding loop annotations (@threads, @openmp, and so on) and how that will affect the backend code generation
I guess we want a "procedure annotation for rewrite no-ops" for explicit synchronization call (sync(..)), where users can define their own backend checks externally. It will also be useful for handling prefetch since it's also a no-op.
The analysis will be more complex if threads and blocks can be non-integer-constant.

Answer 3 · 2024-01-29T21:21:03.000Z

Notes from a discussion with Kevin and William:

Even though consumer and server GPUs have slightly different design points, we probably want to focus on server GPUs for now
Synchronization predicates could be arbitrarily complex, but expressing just a range might be sufficient for many workloads
We should keep in mind that there are warp synchronous operations and non warp synchronous operations