xrsrke/pipegoose

Port CUDA Kernels

xrsrke opened this issue · 6 comments

xrsrke commented

Port training CUDA kernels from these librarys, and automatically replace modules in an existing 🤗 transformers model with their corresponding CUDA kernel version.

Check out the following open source projects, and propose which CUDA kernels we should port. Then write a kernel builder which takes a kernel name, and loads it.

Implementation

class KernelBuilder:
    def load(self):
        pass


class FusedOp(KernelBuilder):
    def absolute_name(self):
         # NOTE: the absolute path to the kernel
         pass

fused_op = FusedOp().load()
outputs = fused_op(inputs)

APIs

import torch.nn.functional as F
from pipegoose.nn.fusion import softmax

assert softmax(x, dim=-1) == F.softmax(x, dim=-1)

TODOs

Hi! So I was thinking of porting substation after reading the paper but a question about how you want the cuda kernels integrated in. In substation, they do optimizations by "generating" cuda files specific to the dimensions of each kernel(from what I understand from looking at https://github.com/spcl/substation/blob/master/pytorch_module/test_softmax.py) so basically it makes cuda code specific for each function so it's faster but may be messier.

Bitsandbytes does it but just loading the .so files from a given location and deepspeed does it by I think building the cstr when doing pip install. So a bit slower but more general.

There might be more ways to do it but which way do you think will work the best for you?

Just checked colossalai, seems like they have functions called op_builders that they use to build certain cuda libraries.

xrsrke commented

@isamu-isozaki This is a good idea. What are the pros and cons of substation? What do you think we should use? If there are some operations that substation is really good at, we could do both substation and manually port kernels.

Also, for manually porting kernel, I think we should do something like this

class KernelBuilder:
    def load(self):
        pass


class FusedOp(KernelBuilder):
    def absolute_name(self):
         # NOTE: the absolute path to the kernel
         pass

fused_op = FusedOp().load()
outputs = fused_op(inputs)

@xrsrke I think substation's method in general is faster, but it needs you to generate a new cuda file for each possible shape of tensor. So the main disadvantage is it's not clean I think(My guess is just changing batch size will need a new cuda script if we were to just copy).

I think the way you are doing is similar to colossalai's and deepspeed's ver which we can definitely do. I do remember that setting up colossalai is pretty troublesome compared to say deepspeed.h I'm not sure why but we can probably cross that bridge when we get there. I think this approach is more general but might be slightly slower than say substation. I think we can probably start with this approach and if we want, extend to substation and build kernels specific to a certain dimension input

do you think this makes sense? I can check out megatron-lm's way etc if you want

xrsrke commented

@isamu-isozaki Could you try to benchmark the two approaches? Try fusing a softmax using substation, then compare the two approaches... (also, it could be that some operations are performed better by substitution, while others are more efficient when written manually. We should take this into account while benchmarking). Or maybe, we should put this for experimental later on.. and for now, just port these kernels.

Also, I've just added GPT-NeoX's kernel to the issue above.