Rust-GPU/rust-cuda

Add ahead-of-time compilation support to cuda builder

Opened this issue · 8 comments

From my knowledge/understanding, cuda_builder only supports JIT compilation. It would be beneficial for user adoption and performance for larger kernel sizes if we provided support for AOT compilation. Not sure if anything other than changes to cuda_builder would be necessary. Thanks @jorge-ortega for helping me flesh out this idea a little more.

This post from Nvidia provides more context. Today, CudaBuilder uses the nvvm backend to compile crates to PTX. The host then loads and JITs through the driver API. Either the backend or CudaBuilder could pass the generated PTX to ptxas to creates AOT compiled cubins that could be loaded by the driver.

I think it would be more idiomatic to treat this this as a different target or a a feature of the target like target-cpu=native. Thoughts?

I think it would be more idiomatic to treat this this as a different target or a a feature of the target like target-cpu=native. Thoughts?

I personally think that it's just a part of pipeline where Rust - PTX - fatbin. Maybe we should support a more complete pipeline and let user devide to what part should the builder build until?

I agree that this is more of a build pipeline option. Our current pipeline is disjointed, and users have to glue the ptx into their host binaries themselves. We should target getting fatbins embedded into the final host binary to match what nvcc does. That's different from what's being asked here but does get us a step towards that.

Sure, but I'm thinking for future integration in rustc...I actually think this maps pretty close to crate_type!

Possibly useful techniques: https://github.com/calebzulawski/multiversion

For large language model optimizations, there are a lot of kernels that are written specialized for a specific NVidia card and using CPU to select based on the user input and the card used. multiversion is probably useful but not flexible enough. Anyway, people can come back to the default match case to take back control and flexibility.

Right, that is why I said "techniques" rather than saying it is useful on its own 😁.

I think it is most idiomatic to use crate_type for jit vs AOT and target features for device-specific features (including target-gpu=native), similar to what rustc uses for CPU (target-cpu) and rust-gpu uses for capabilities and extensions (https://github.com/Rust-GPU/rust-gpu/blob/698f10ac14b7c952394ac5620004e4e973308902/crates/spirv-std/src/arch.rs#L151).