NVIDIA/cuda-quantum

Unable to use all GPUs with Python and `nvidia-mgpu` target

bebora opened this issue · 2 comments

Required prerequisites

  • Consult the security policy. If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
  • Make sure you've read the documentation. Your issue may be addressed there.
  • Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
  • If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

The documentation states that the nvidia-mgpu target can be used to distribute the state vector between available GPUs. The state vector seems to be correctly distributed when using CUDA-Q for C++. However, I'm unable to achieve such a distribution using CUDA-Q for Python.

Steps to reproduce the bug

I'm using the Singularity image on a machine with two Tesla V100S GPUs.
Create a cuda-quantum.def file with the following content:

Bootstrap: docker

From: nvcr.io/nvidia/quantum/cuda-quantum:0.7.0
%runscript
    mount devpts /dev/pts -t devpts
    cp -r /home/cudaq/* .
    bash

Build the image with singularity build --fakeroot cuda-quantum.sif cuda-quantum.def
Run the image with singularity shell --nv --no-mount hostfs ./cuda-quantum.sif
Create a file cuquantum_backends.py copying the code from the Python examples.
Create a file cuquantum_backends.cpp copying the code from the C++ examples.
Open another terminal to observe the output of watch -n 1 nvidia-smi.

30 qubits

Edit the two files to use 30 qubits (qubit_count = 30 and auto counts = cudaq::sample(/*shots=*/100, ghz{}, 30);).

Compile and run the C++ version with the following commands:

nvq++ cuquantum_backends.cpp -o nvidia-mgpu.x --target nvidia-mgpu
mpirun -np 2 ./nvidia-mgpu.x

I can see a process for each GPU, peaking at about 8660MiB VRAM each.

Run the Python version with the following command:

mpirun -np 2 python3 cuquantum_backends.py --target=nvidia-mgpu

I can see two processes both running on GPU 0, peaking at 8528MiB VRAM each.

31 qubits

Edit the two files to use 31 qubits (qubit_count = 31 and auto counts = cudaq::sample(/*shots=*/100, ghz{}, 31);).

Compile and run the C++ version with the following commands:

nvq++ cuquantum_backends.cpp -o nvidia-mgpu.x --target nvidia-mgpu
mpirun -np 2 ./nvidia-mgpu.x

I can see a process for each GPU, peaking at about 16854MiB VRAM each.

Run the Python version with the following command:

mpirun -np 2 python3 cuquantum_backends.py --target=nvidia-mgpu

One process runs using GPU 0, peaking at about 16720MiB VRAM. The other process is terminated with the following error RuntimeError: [custatevec] %out of memory in addQubitsToState (line 210).

Expected behavior

I would expect the Python version to distribute the state vector among multiple GPUs, like the C++ version, rather than being limited to the memory of one GPU.

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

  • CUDA Quantum version: 0.7.0 (also 0.7.1, with a slightly different memory consumption)
  • Python version: 3.10.12
  • Operating system: RHEL 9.3
  • nvidia-smi:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100S-PCIE-32GB          Off |   00000000:01:00.0 Off |                    0 |
| N/A   33C    P0             35W /  250W |       0MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100S-PCIE-32GB          Off |   00000000:C1:00.0 Off |                    0 |
| N/A   33C    P0             37W /  250W |       0MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Suggestions

No response

Hi @bebora - per the docs, the Python interface is not expecting = in between --target and nvidia-mgpu. That is - I think you need to run it like this:

$ mpirun -np 2 python3 examples/python/cuquantum_backends.py --target nvidia-mgpu

Alternatively, you could try putting cudaq.set_target('nvidia-mgpu') directly in your Python code, too.

That being said - we should probably make the Python command-line interface allow the same syntax as the C++ for the target option.

Hi @bmhowe23, that was indeed the problem. Thank you for the assistance.