C++ interop
mratsim opened this issue · 3 comments
Not sure what Calypso is but I have no problem doing C++ Cuda integration in Nim:
Example on applying element-wise operation on 2 tensors dest and src
# Assignment op
# Does element-wise A[i] `op=` B[i]
template cuda_assign_op(op_name, op_symbol: string)=
{.emit: ["""
template<typename T>
struct """,op_name,"""{
__device__ __forceinline__ void operator()(
T * __restrict__ dst,
const T * __restrict__ src){
*dst """,op_symbol,""" __ldg(src);
}
};
"""].}
cuda_assign_op("CopyOp", "=")
cuda_assign_op("mAddOp", "+=")
cuda_assign_op("mSubOp", "-=")
cuda_assign_op("mMulOp", "*=")
cuda_assign_op("mDivOp", "/=")
Apply the operation on a whole tensor
Apply Cuda C++ Higher order function
{.emit:["""
template<typename T, typename Op>
__global__ void cuda_apply2(const int rank,
const int len,
const int * __restrict__ dst_shape,
const int * __restrict__ dst_strides,
const int dst_offset,
T * __restrict__ dst_data,
Op f,
const int * __restrict__ src_shape,
const int * __restrict__ src_strides,
const int src_offset,
const T * __restrict__ src_data){
for (int elemID = blockIdx.x * blockDim.x + threadIdx.x;
elemID < len;
elemID += blockDim.x * gridDim.x) {
// ## we can't instantiate the variable outside the loop
// ## each threads will store its own in parallel
const int dst_real_idx = cuda_getIndexOfElementID(
rank,
dst_shape,
dst_strides,
dst_offset,
elemID);
const int src_real_idx = cuda_getIndexOfElementID(
rank,
src_shape,
src_strides,
src_offset,
elemID);
f(&dst_data[dst_real_idx], &src_data[src_real_idx]);
}
}
"""].}
Generate the Nim <-> C++ bindings
template cuda_assign_binding(kernel_name: string, binding_name: untyped)=
# Generate a Nim proc that wraps the C++/Cuda kernel proc
const import_string:string = kernel_name & "<'*8>(@)"
# We pass the 8th parameter type to the template.
# The "*" in '*8 is needed to remove the pointer *
# We create an new identifier on the fly with backticks
proc `binding_name`[T: SomeReal](
blocksPerGrid, threadsPerBlock: cint,
rank, len: cint,
dst_shape, dst_strides: ptr cint, dst_offset: cint, dst_data: ptr T,
src_shape, src_strides: ptr cint, src_offset: cint, src_data: ptr T
) {.importcpp: import_string, noSideEffect.}
template cuda_assign_glue*(
kernel_name, op_name: string, binding_name: untyped): untyped =
# Input
# - kernel_name and the Cuda function object
# Result
# - Auto-generate cuda kernel based on the function object
# - Bindings with name "kernel_name" that can be called directly
# or with the convenience function ``cuda_assign_call``
{.emit:["""
template<typename T>
inline void """, kernel_name,"""(
const int blocksPerGrid, const int threadsPerBlock,
const int rank,
const int len,
const int * __restrict__ dst_shape,
const int * __restrict__ dst_strides,
const int dst_offset,
T * __restrict__ dst_data,
const int * __restrict__ src_shape,
const int * __restrict__ src_strides,
const int src_offset,
const T * __restrict__ src_data){
cuda_apply2<<<blocksPerGrid, threadsPerBlock>>>(
rank, len,
dst_shape, dst_strides, dst_offset, dst_data,
""",op_name,"""<T>(),
src_shape, src_strides, src_offset, src_data
);
}
"""].}
cuda_assign_binding(kernel_name, binding_name)
template cuda_assign_call*[T: SomeReal](
kernel_name: untyped, destination: var CudaTensor[T], source: CudaTensor[T]): untyped =
## Does the heavy-lifting to format the tensors for the cuda call
#
# TODO: why doesn't this template works with "cudaLaunchKernel" instead
# of triple-chevrons notation kernel<<<blocksPerGrid, threadsPerBlock>>>(params).
# This would avoid an intermediate function call
let dst = layoutOnDevice destination
let src = layoutOnDevice source
kernel_name[T](
CUDA_HOF_TPB, CUDA_HOF_BPG,
src.rank, dst.len, # Note: small shortcut, in this case len and size are the same
dst.shape[], dst.strides[],
dst.offset, dst.data,
src.shape[], src.strides[],
src.offset, src.data
)
Usage:
cuda_assign_glue("cuda_mAdd", "mAddOp", cuda_mAdd)
proc `+=`*[T: SomeReal](a: var CudaTensor[T], b: CudaTensor[T]) =
## CudaTensor in-place addition
when compileOption("boundChecks"):
check_elementwise(a,b)
cuda_assign_call(cuda_mAdd, a, b)
cuda_assign_glue("cuda_mSub", "mSubOp", cuda_mSub)
proc `-=`*[T: SomeReal](a: var CudaTensor[T], b: CudaTensor[T]) =
## CudaTensor in-place substraction
when compileOption("boundChecks"):
check_elementwise(a,b)
cuda_assign_call(cuda_mSub, a, b)
...
@mratsim author probably meant that both D (with this special compiler) and Nim can both use C++ directly
comment from @Laeeth which I'm moving to here:
I guess you should mention extern(C++). You can use dpp to #include C++ headers if you blacklist STL and Boost types and it will work for quite a lot. It's still not complete yet but it keeps getting better and in a year or two STL should be fine. It's used in production but it's not something I would say is easy for just anyone to use and expect to work immediately. For C++ - C mostly just works.
Can you do that now using Nim? Just include c++ headers and use them?
How about generating c++ headers from Nim code? We do that in D but the available options aren't yet polished.
PR's welcome to update this section (especially with the latest D work on extern(C++)
, but it deserves a whole entry on its own; this is something I've been very interested in myself, and for which I contributed both in D (see https://github.com/Syniurge/Calypso) and in Nim (see https://github.com/nimterop/nimterop: Nimterop is a Nim package that aims to make C/C++ interop seamless)
the answer is a lot more complicated and more nuanced than with C #30.
For example you can do the same as for #30 but with C++ (importcpp
and emit
still apply to C++):
# search for other such examples in Nim repo
proc make_unique_str(a: cstring): stdUniquePtr[stdString] {.importcpp: "std::make_unique<std::string>(#)", header: "<string>".}
proc foobar(a: cint) {.importcpp.} # untested but something like that works; at very least by providing mangled name
{.emit:"""
#include <stdio.h>
template<typename T> void foobar(T a){
printf("in foobar\n");
}
template void foobar<int>(int a); // explicit instantiation required otherwise link error
""".}
however this won't work with templates (unless explicitly instantiated); and also still requires defining headers manually;
this is what nimterop (and to a lesser extend the "official" c2nim
) aim to fix: wrapper free interop. Templates (that aren't instantiated) aren't yet supported in nimterop. My understanding is this isn't also supported in D (unless you use Calypso)