huggingface/ratchet

Operator Fusion

Opened this issue · 1 comments

Crucial and ties into Code Generation.

allocations

The above graph demonstrates the success of our current inplacing algorithm.

However, we need to take this a step further and go from Inplacing to Inlining.

fn main(...) {
    let x_offset = group_id.x * 64u;
    var dst_offset = (group_id.y * num_groups.x * 64u) + x_offset + local_index;

    //Convert 1D offset into 4D index
    let dst_index = offsetToNdIndex(dst_offset, metadata.dst_stride);

    var src_index = vec4<u32>(0u);
    src_index[metadata.perm[0]] = dst_index[0]; 
    src_index[metadata.perm[1]] = dst_index[1];
    src_index[metadata.perm[2]] = dst_index[2];
    src_index[metadata.perm[3]] = dst_index[3];
    
    //Convert 4D index into 1D offset
    let src_offset = ndIndexToOffset(src_index, metadata.src_offsets, metadata.src_stride);

    Y[dst_offset] = X[src_offset];
}

The above is our current permute shader. Instead of performing subsequent injective operations on the output buffer of permute, we could inline all of the injective operations like so:

fn main(...) {
    //omit
    Y[dst_offset] = cos(exp(gelu(X[src_offset])
}

This (contrived) example would cause everything to be collapsed to a single node, and is super important.

Sharing my thoughts from our conversation:

  • you'll want to introduce an IR that keeps track of the size of each tensor and the "type" of each operation
  • you can coalesce operations with the same "type" - for the example you've given, you have elementwise operations of cos / exp / gelu - you can bundle these into a single node
  • for this, runtime code generation will be needed for each IR node, as you will no longer know ahead of time what your final execution environment will look like