Operator Fusion
Opened this issue · 1 comments
FL33TW00D commented
Crucial and ties into Code Generation.
The above graph demonstrates the success of our current inplacing algorithm.
However, we need to take this a step further and go from Inplacing
to Inlining
.
fn main(...) {
let x_offset = group_id.x * 64u;
var dst_offset = (group_id.y * num_groups.x * 64u) + x_offset + local_index;
//Convert 1D offset into 4D index
let dst_index = offsetToNdIndex(dst_offset, metadata.dst_stride);
var src_index = vec4<u32>(0u);
src_index[metadata.perm[0]] = dst_index[0];
src_index[metadata.perm[1]] = dst_index[1];
src_index[metadata.perm[2]] = dst_index[2];
src_index[metadata.perm[3]] = dst_index[3];
//Convert 4D index into 1D offset
let src_offset = ndIndexToOffset(src_index, metadata.src_offsets, metadata.src_stride);
Y[dst_offset] = X[src_offset];
}
The above is our current permute shader. Instead of performing subsequent injective operations on the output buffer of permute
, we could inline all of the injective operations like so:
fn main(...) {
//omit
Y[dst_offset] = cos(exp(gelu(X[src_offset])
}
This (contrived) example would cause everything to be collapsed to a single node, and is super important.
philpax commented
Sharing my thoughts from our conversation:
- you'll want to introduce an IR that keeps track of the size of each tensor and the "type" of each operation
- you can coalesce operations with the same "type" - for the example you've given, you have elementwise operations of cos / exp / gelu - you can bundle these into a single node
- for this, runtime code generation will be needed for each IR node, as you will no longer know ahead of time what your final execution environment will look like