NVIDIA/cuda-quantum

Misleading error (JIT compilation issue) for remote-mqpu backend when MPI plugin is not activated

bettinaheim opened this issue · 3 comments

Required prerequisites

  • Consult the security policy. If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
  • Make sure you've read the documentation. Your issue may be addressed there.
  • Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
  • If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

In some cases, the execution on the remote-mqpu backend fails with a JIT error along the lines of
JIT session error: Symbols not found: [ _Unwind_Resume, _ZNSaIcED2Ev, ...]

The error is caused by the invokeWrappedKernel logic in /runtime/common/JIT.cpp. Specifically, I think we are running into something like this: https://stackoverflow.com/questions/57612173/llvm-jit-symbols-not-found
The _Unwind_Resume symbol is from the GNU C++ standard library, specifically from libsupc++.a. I double checked that the produced executable itself (a.out) contains that symbol, so I suspect it is indeed something about these lines that is not working as expected:

  // Resolve symbols that are statically linked in the current process.
  llvm::orc::JITDylib &mainJD = jit->getMainJITDylib();
  mainJD.addGenerator(llvm::cantFail(
      llvm::orc::DynamicLibrarySearchGenerator::GetForCurrentProcess(
          dataLayout.getGlobalPrefix())));

Steps to reproduce the bug

Minimal repro:
Download the latest version of the CUDA Quantum installer for C++, or build it from source. Then run

container=`docker run -itd --rm ubuntu:22.04`
docker cp install_cuda_quantum.$(uname -m) $container:/tmp/
docker cp docs/sphinx/examples/cpp/algorithms/amplitude_estimation.cpp $container:/tmp
docker attach $container
apt-get update && apt-get install -y --no-install-recommends \
            wget ca-certificates libstdc++-11-dev libopenmpi-dev
chmod +x /tmp/install_cuda_quantum.x86_64
/tmp/install_cuda_quantum.x86_64 --accept && . /etc/profile

# Fails with the reported JIT exception during execution:
nvq++ --target remote-mqpu /tmp/amplitude_estimation.cpp && ./a.out
objdump -T a.out | grep -i unwind # shows _Unwind_Resume exists

# Works fine:
nvq++ --target remote-mqpu /tmp/amplitude_estimation.cpp --enable-mlir && ./a.out

The installer can be build from source by building the cuda-quantum-assets:
docker build -t cuda-quantum-assets:latest -f docker/build/assets.Dockerfile .
and then building the installer:
DOCKER_BUILDKIT=1 docker build -f docker/release/installer.Dockerfile --build-arg base_image=cuda-quantum-assets:latest . --output out

Expected behavior

The example should compile and run without error.

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

  • CUDA Quantum version: 0.6.0
  • Python version: N/A
  • C++ compiler: nvq++, with libstdc++-11-dev installed separately
  • Operating system: Ubuntu 22.04, possibly other systems as well.

Suggestions

No response

Of course, the moment I actually write down the exact repro, it occurs to me what is ultimately causing the issue is the missing MPI plugin:

Proceed as above, but then

 export MPI_PATH=/usr/lib/x86_64-linux-gnu/openmpi
 bash $CUDA_QUANTUM_PATH/distributed_interfaces/activate_custom_mpi.sh
 nvq++ --target remote-mqpu /tmp/amplitude_estimation.cpp && ./a.out # now works (though why is it printing the llvm::dbgs() messages? - that's not nice...)

Edit nr 2: I quickly tried out if I at least get a decent/comprehensive error when I don't have MPI installed at all. Unfortunately, the compilation succeeds and I get pretty much the same error as above, which is not really comprehensive.

Options for resolution:

  1. Require MPI to be installed to use the remote-mqpu backend. In that case, we need to document this and add a compilation check to give a nice comprehensive error along the lines of "This target requires MPI. Please install MPI and try again." when MPI is missing.
  2. Not require MPI and do the same as we do for the nvidia-mqpu target. I think this in principle is what we do, and I think that is the better option.