Operator Fusion

Question

Operator Fusion

Opened this issue 9 months ago · 1 comments

Crucial and ties into Code Generation.

The above graph demonstrates the success of our current inplacing algorithm.

However, we need to take this a step further and go from Inplacing to Inlining.

fn main(...) {
    let x_offset = group_id.x * 64u;
    var dst_offset = (group_id.y * num_groups.x * 64u) + x_offset + local_index;

    //Convert 1D offset into 4D index
    let dst_index = offsetToNdIndex(dst_offset, metadata.dst_stride);

    var src_index = vec4<u32>(0u);
    src_index[metadata.perm[0]] = dst_index[0]; 
    src_index[metadata.perm[1]] = dst_index[1];
    src_index[metadata.perm[2]] = dst_index[2];
    src_index[metadata.perm[3]] = dst_index[3];
    
    //Convert 4D index into 1D offset
    let src_offset = ndIndexToOffset(src_index, metadata.src_offsets, metadata.src_stride);

    Y[dst_offset] = X[src_offset];
}

The above is our current permute shader. Instead of performing subsequent injective operations on the output buffer of permute, we could inline all of the injective operations like so:

fn main(...) {
    //omit
    Y[dst_offset] = cos(exp(gelu(X[src_offset])
}

This (contrived) example would cause everything to be collapsed to a single node, and is super important.

Answer 1 · 2024-05-09T17:30:16.000Z

Sharing my thoughts from our conversation:

you'll want to introduce an IR that keeps track of the size of each tensor and the "type" of each operation
you can coalesce operations with the same "type" - for the example you've given, you have elementwise operations of cos / exp / gelu - you can bundle these into a single node
for this, runtime code generation will be needed for each IR node, as you will no longer know ahead of time what your final execution environment will look like