Examples of manually fused kernels with nvfuser, currently only fused multihead attention forward pass. Build python setup.py install Test python test.py See perf python perf_measurement.py See generated kernel PYTORCH_NVFUSER_DUMP=cuda_kernel python test.py