Xilinx/mlir-aie

Multiple kernels with MLIR python SDK

Joel-De opened this issue ยท 8 comments

Not an issue but a question.

Is exposing multiple kernels to the c++ api supported with the MLIR python SDK? I'm working with the NPU on a phoenix processor and using some of the reference designs. In these examples only a single kernel is ever called from the C++ api, what I'd like to know is if it's possible to have 2 distinct kernels (say one for MatMul, and another for image edge detection) that can be called independency from the c++ api without needing to reload a new ./xclbin. If this is possible can we configure the AIE tiles such each kernel uses the tiles in a unique configuration that isn't mutually exclusive?

Thanks

This is somewhat of a complicated question. The short answer is, we've never tried doing something like that, and would love to see what you come up with! The longer answer is: There's probably multiple ways to approach this problem. If you're loading a single xclbin, then you could pass a parameter that affects the functionality and get the appearance of 2 distinct kernels that way. There's also some assumptions in the way we generate the xrt interface of the xclbin that could be generalized to have 2 different kernels exposed at the XRT level (along with functions with more arguments, etc.) Another way would be to have 2 separate .xclbins, which use a portion of the XDNA accelerator, so they can be resident the same time, but since they are independent, could be loaded simultaneously and operate simultaneously.

Thanks! I supposed I could narrow the scope of the question and give you an idea of what we're trying to accomplish with the AIEs. Essentially we're trying to run a DL model whose weights exceed what we can theoretically fit on the AIE-ML memory tiles (On the order of several hundred MB). What we wanted to do was break this into smaller kernel execution calls and execute several kernels multiple times with different data (at the XRT level, so allocate several buffers and call the kernels in a loop - note that we would need more than 1 kernel accessible to the XRT API). Is there a way to go about feeding several hundred MB of data to the AIE's via a single kernel call? Or is the buffer going to overflow when we make a kernel call with a buffer size greater than what the AIE tiles can fit.

I haven't seen any examples of loading 2 .xclbin files simultaneously, is there a guide / reference design where that is done? From the looks of things it does seem that the XRT API was designed to support multiple kernels (primarily the fact that it loads it by name), but from the MLIR and MLIR python SDK code it looks like only one can ever produced (the name of the kernel looks to be set in stone to "mlir-aie" across all examples, and configuration of that isn't clearly exposed in the user code).

Also if we are locked to a single kernel per .xclbin, is this a limit on the mlir-aie side (this repo) or the hardware firmware end of things (linux driver for the NPU)?

Thanks!

Maybe a fundamental misconception: When you pass data to a kernel, you're passing a pointer to data in external DDR memory. The kernel is responsible for loading that data (maybe into memtiles, maybe not). So, yes, you can feed several hundred MB of data to the AIE in a single XRT call: no need to have multiple .xclbins for that. Single kernel per .xclbin and the number of arguments to that kernel are largely limits of the current implementation in mlir-aie and not fundamental limits.

Ah I see, that clears up a lot thanks! As an aside, Is there any plan on increasing the current argument limit in the near future?

@makslevental has been leading the charge there, but I'm not sure how far things are and how much they can be generalized.

Thanks for the info ๐Ÿ˜ƒ, I'll mark this issue as closed - if there are any updates pertaining to the original issue feel free to re-open or add comments below!

Ah I see, that clears up a lot thanks! As an aside, Is there any plan on increasing the current argument limit in the near future?

There are examples in the repo that do this already; here is one that writes out 10 args (but not loads but that's easy to flip). It goes through a compile/load/execute path that is not yet documented and should be considered alpha quality. Having given that disclaimer, I use this flow "every day" and it works for me; but of course I built it so I'm familiar with most of the sharp edges ๐Ÿ˜„. If you're interested, take a look, give it a spin; I'm happy to extend/adapt/help you get your thing going. You can post questions/concerns here (but maybe not in follow-up issues yet since it's alpha) or you can dm me on our discord under the same handle.

Another simple approach is to pack all of the weights in a single buffer, and use the DMA instruction sequence to feed those weights into your design at runtime