Unable to use all GPUs with Python and `nvidia-mgpu` target

Question

Unable to use all GPUs with Python and `nvidia-mgpu` target

bebora opened this issue a month ago · 2 comments

Required prerequisites

Consult the security policy. If reporting a security vulnerability, do not report the bug using this form. Use the process described in the policy to report the issue.
Make sure you've read the documentation. Your issue may be addressed there.
Search the issue tracker to verify that this hasn't already been reported. +1 or comment there if it has.
If possible, make a PR with a failing test to give us a starting point to work on!

Describe the bug

The documentation states that the nvidia-mgpu target can be used to distribute the state vector between available GPUs. The state vector seems to be correctly distributed when using CUDA-Q for C++. However, I'm unable to achieve such a distribution using CUDA-Q for Python.

Steps to reproduce the bug

I'm using the Singularity image on a machine with two Tesla V100S GPUs.
Create a cuda-quantum.def file with the following content:

Bootstrap: docker

From: nvcr.io/nvidia/quantum/cuda-quantum:0.7.0
%runscript
    mount devpts /dev/pts -t devpts
    cp -r /home/cudaq/* .
    bash

Build the image with singularity build --fakeroot cuda-quantum.sif cuda-quantum.def
Run the image with singularity shell --nv --no-mount hostfs ./cuda-quantum.sif
Create a file cuquantum_backends.py copying the code from the Python examples.
Create a file cuquantum_backends.cpp copying the code from the C++ examples.
Open another terminal to observe the output of watch -n 1 nvidia-smi.

30 qubits

Edit the two files to use 30 qubits (qubit_count = 30 and auto counts = cudaq::sample(/*shots=*/100, ghz{}, 30);).

Compile and run the C++ version with the following commands:

nvq++ cuquantum_backends.cpp -o nvidia-mgpu.x --target nvidia-mgpu
mpirun -np 2 ./nvidia-mgpu.x

I can see a process for each GPU, peaking at about 8660MiB VRAM each.

Run the Python version with the following command:

mpirun -np 2 python3 cuquantum_backends.py --target=nvidia-mgpu

I can see two processes both running on GPU 0, peaking at 8528MiB VRAM each.

31 qubits

Edit the two files to use 31 qubits (qubit_count = 31 and auto counts = cudaq::sample(/*shots=*/100, ghz{}, 31);).

Compile and run the C++ version with the following commands:

nvq++ cuquantum_backends.cpp -o nvidia-mgpu.x --target nvidia-mgpu
mpirun -np 2 ./nvidia-mgpu.x

I can see a process for each GPU, peaking at about 16854MiB VRAM each.

Run the Python version with the following command:

mpirun -np 2 python3 cuquantum_backends.py --target=nvidia-mgpu

One process runs using GPU 0, peaking at about 16720MiB VRAM. The other process is terminated with the following error RuntimeError: [custatevec] %out of memory in addQubitsToState (line 210).

Expected behavior

I would expect the Python version to distribute the state vector among multiple GPUs, like the C++ version, rather than being limited to the memory of one GPU.

Is this a regression? If it is, put the last known working version (or commit) here.

Not a regression

Environment

CUDA Quantum version: 0.7.0 (also 0.7.1, with a slightly different memory consumption)
Python version: 3.10.12
Operating system: RHEL 9.3
nvidia-smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100S-PCIE-32GB          Off |   00000000:01:00.0 Off |                    0 |
| N/A   33C    P0             35W /  250W |       0MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100S-PCIE-32GB          Off |   00000000:C1:00.0 Off |                    0 |
| N/A   33C    P0             37W /  250W |       0MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Suggestions

No response

Answer 1 · 2024-06-03T16:20:25.000Z

Hi @bebora - per the docs, the Python interface is not expecting = in between --target and nvidia-mgpu. That is - I think you need to run it like this:

$ mpirun -np 2 python3 examples/python/cuquantum_backends.py --target nvidia-mgpu

Alternatively, you could try putting cudaq.set_target('nvidia-mgpu') directly in your Python code, too.

That being said - we should probably make the Python command-line interface allow the same syntax as the C++ for the target option.

Answer 2 · 2024-06-03T20:29:12.000Z

Hi @bmhowe23, that was indeed the problem. Thank you for the assistance.