C++ interop

Question

C++ interop

mratsim opened this issue 6 years ago · 3 comments

Not sure what Calypso is but I have no problem doing C++ Cuda integration in Nim:

Example on applying element-wise operation on 2 tensors dest and src

# Assignment op
# Does element-wise A[i] `op=` B[i]
template cuda_assign_op(op_name, op_symbol: string)=
  {.emit: ["""
  template<typename T>
  struct """,op_name,"""{
  __device__ __forceinline__ void operator()(
      T *  __restrict__ dst,
      const T *  __restrict__ src){
      *dst """,op_symbol,""" __ldg(src);
      }
  };
  """].}
cuda_assign_op("CopyOp", "=")
cuda_assign_op("mAddOp", "+=")
cuda_assign_op("mSubOp", "-=")
cuda_assign_op("mMulOp", "*=")
cuda_assign_op("mDivOp", "/=")

Apply the operation on a whole tensor

Apply Cuda C++ Higher order function

{.emit:["""
  template<typename T, typename Op>
  __global__ void cuda_apply2(const int rank,
                              const int len,
                              const int *  __restrict__ dst_shape,
                              const int *  __restrict__ dst_strides,
                              const int dst_offset,
                              T * __restrict__ dst_data,
                              Op f,
                              const int *  __restrict__ src_shape,
                              const int *  __restrict__ src_strides,
                              const int src_offset,
                              const T * __restrict__ src_data){
    for (int elemID = blockIdx.x * blockDim.x + threadIdx.x;
         elemID < len;
         elemID += blockDim.x * gridDim.x) {
      // ## we can't instantiate the variable outside the loop
      // ## each threads will store its own in parallel
      const int dst_real_idx = cuda_getIndexOfElementID(
                               rank,
                               dst_shape,
                               dst_strides,
                               dst_offset,
                               elemID);
      const int src_real_idx = cuda_getIndexOfElementID(
                               rank,
                               src_shape,
                               src_strides,
                               src_offset,
                               elemID);
      f(&dst_data[dst_real_idx], &src_data[src_real_idx]);
    }
  }
"""].}

Generate the Nim <-> C++ bindings

template cuda_assign_binding(kernel_name: string, binding_name: untyped)=
  # Generate a Nim proc that wraps the C++/Cuda kernel proc

  const import_string:string = kernel_name & "<'*8>(@)"
  # We pass the 8th parameter type to the template.
  # The "*" in '*8 is needed to remove the pointer *

  # We create an new identifier on the fly with backticks
  proc `binding_name`[T: SomeReal](
    blocksPerGrid, threadsPerBlock: cint,
    rank, len: cint,
    dst_shape, dst_strides: ptr cint, dst_offset: cint, dst_data: ptr T,
    src_shape, src_strides: ptr cint, src_offset: cint, src_data: ptr T
  ) {.importcpp: import_string, noSideEffect.}



template cuda_assign_glue*(
  kernel_name, op_name: string, binding_name: untyped): untyped =
  # Input
  #   - kernel_name and the Cuda function object
  # Result
  #   - Auto-generate cuda kernel based on the function object
  #   - Bindings with name "kernel_name" that can be called directly
  #   or with the convenience function ``cuda_assign_call``

  {.emit:["""
  template<typename T>
  inline void """, kernel_name,"""(
    const int blocksPerGrid, const int threadsPerBlock,
    const int rank,
    const int len,
    const int * __restrict__ dst_shape,
    const int * __restrict__ dst_strides,
    const int dst_offset,
    T * __restrict__ dst_data,
    const int * __restrict__ src_shape,
    const int * __restrict__ src_strides,
    const int src_offset,
    const T * __restrict__ src_data){
      cuda_apply2<<<blocksPerGrid, threadsPerBlock>>>(
        rank, len,
        dst_shape, dst_strides, dst_offset, dst_data,
        """,op_name,"""<T>(),
        src_shape, src_strides, src_offset, src_data
      );
    }
    """].}

  cuda_assign_binding(kernel_name, binding_name)

template cuda_assign_call*[T: SomeReal](
  kernel_name: untyped, destination: var CudaTensor[T], source: CudaTensor[T]): untyped =
  ## Does the heavy-lifting to format the tensors for the cuda call
  #
  # TODO: why doesn't this template works with "cudaLaunchKernel" instead
  # of triple-chevrons notation kernel<<<blocksPerGrid, threadsPerBlock>>>(params).
  # This would avoid an intermediate function call

  let dst = layoutOnDevice destination
  let src = layoutOnDevice source

  kernel_name[T](
    CUDA_HOF_TPB, CUDA_HOF_BPG,
    src.rank, dst.len, # Note: small shortcut, in this case len and size are the same
    dst.shape[], dst.strides[],
    dst.offset, dst.data,
    src.shape[], src.strides[],
    src.offset, src.data
)

Usage:

cuda_assign_glue("cuda_mAdd", "mAddOp", cuda_mAdd)

proc `+=`*[T: SomeReal](a: var CudaTensor[T], b: CudaTensor[T]) =
  ## CudaTensor in-place addition
  when compileOption("boundChecks"):
    check_elementwise(a,b)

  cuda_assign_call(cuda_mAdd, a, b)

cuda_assign_glue("cuda_mSub", "mSubOp", cuda_mSub)
proc `-=`*[T: SomeReal](a: var CudaTensor[T], b: CudaTensor[T]) =
  ## CudaTensor in-place substraction

  when compileOption("boundChecks"):
    check_elementwise(a,b)

  cuda_assign_call(cuda_mSub, a, b)

...

Answer 1 · 2018-03-24T13:58:21.000Z

@mratsim author probably meant that both D (with this special compiler) and Nim can both use C++ directly

Answer 2 · 2019-07-11T00:21:22.000Z

comment from @Laeeth which I'm moving to here:

I guess you should mention extern(C++). You can use dpp to #include C++ headers if you blacklist STL and Boost types and it will work for quite a lot. It's still not complete yet but it keeps getting better and in a year or two STL should be fine. It's used in production but it's not something I would say is easy for just anyone to use and expect to work immediately. For C++ - C mostly just works.
Can you do that now using Nim? Just include c++ headers and use them?
How about generating c++ headers from Nim code? We do that in D but the available options aren't yet polished.

Answer 3 · 2019-07-11T00:21:44.000Z

PR's welcome to update this section (especially with the latest D work on extern(C++), but it deserves a whole entry on its own; this is something I've been very interested in myself, and for which I contributed both in D (see https://github.com/Syniurge/Calypso) and in Nim (see https://github.com/nimterop/nimterop: Nimterop is a Nim package that aims to make C/C++ interop seamless)

the answer is a lot more complicated and more nuanced than with C #30.

For example you can do the same as for #30 but with C++ (importcpp and emit still apply to C++):

# search for other such examples in Nim repo
proc make_unique_str(a: cstring): stdUniquePtr[stdString] {.importcpp: "std::make_unique<std::string>(#)", header: "<string>".}
proc foobar(a: cint) {.importcpp.} # untested but something like that works; at very least by providing mangled name
{.emit:"""
#include <stdio.h>
template<typename T> void foobar(T a){
  printf("in foobar\n");
}
template void foobar<int>(int a); // explicit instantiation required otherwise link error
""".}

however this won't work with templates (unless explicitly instantiated); and also still requires defining headers manually;

this is what nimterop (and to a lesser extend the "official" c2nim) aim to fix: wrapper free interop. Templates (that aren't instantiated) aren't yet supported in nimterop. My understanding is this isn't also supported in D (unless you use Calypso)

Example on applying element-wise operation on 2 tensors dest and src

Apply the operation on a whole tensor

Generate the Nim <-> C++ bindings

Usage:

relevant links