Process hangs on 'Setting up PyTorch plugin "bias_act_plugin"...' when using multiple GPUs
markemus opened this issue · 6 comments
I added these lines to train.py as lines 13 and 14 (right under import os
):
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0,2,3,4"
I tested the process with --gpus 1 and it spent a few minutes on Setting up PyTorch plugin "bias_act_plugin"...
but then proceeded to train. However with --gpus 4 it has been hanging on this line for an hour and a half.
Creating output directory...
Launching processes...
Loading training set...
Num images: 505487
Image shape: [3, 256, 256]
Label shape: [0]
Constructing networks...
Setting up PyTorch plugin "bias_act_plugin"...
Here's the nvidia-smi
printout as well. As you can see three of the cores (2,3,4) have 100% GPU utilization while the first core (0) has 0%. The memory usage does not seem to be changing.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:1A:00.0 Off | 0 |
| N/A 33C P0 57W / 300W | 2088MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:1B:00.0 Off | 0 |
| N/A 34C P0 59W / 300W | 31147MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:3D:00.0 Off | 0 |
| N/A 35C P0 68W / 300W | 4261MiB / 32510MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:3E:00.0 Off | 0 |
| N/A 31C P0 68W / 300W | 4345MiB / 32510MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:88:00.0 Off | 0 |
| N/A 33C P0 71W / 300W | 4201MiB / 32510MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Do I just need to be more patient? On one core it really only took a couple of minutes to begin training.
EDIT: note that the cores (0,2,3,4) are not consecutive.
No, it definitely shouldn't take long. About the same as on single core.
I'd try a couple of things if the problem persists:
- Set CUDA_VISIBLE_DEVICES in the shell before starting the Python process, just so that there isn't anything funky going on with multiprocessing.
- This could be a case of a stale multiprocess lock in ~/.cache/torch_extensions (default on Linux). Try
rm -rf
'ing the torch_extensions directory and rerun.
If you're running docker, you should NOT need CUDA_VISIBLE_DEVICES separately. I think it's enough to configure available devices using the --gpus
parameter. Also I think within Docker, CUDA_VISIBLE_DEVICES
might in fact need to be in consecutive order (or probably does not need to be specified within the container), and that you should specify the real device mapping when you start docker. I'm a bit on thin ice here with this as I haven't ever run using such a configuration.
@nurpax Thank you! rm -rf ~/.cache/torch_extensions
solved the issue and it's training now on 4 GPUs.
For posterity: I left the CUDA_VISIBLE_DEVICES
definition in the code. This is not running in docker, it's running in Anaconda.
No, it definitely shouldn't take long. About the same as on single core.
I'd try a couple of things if the problem persists:
- Set CUDA_VISIBLE_DEVICES in the shell before starting the Python process, just so that there isn't anything funky going on with multiprocessing.
- This could be a case of a stale multiprocess lock in ~/.cache/torch_extensions (default on Linux). Try
rm -rf
'ing the torch_extensions directory and rerun.If you're running docker, you should NOT need CUDA_VISIBLE_DEVICES separately. I think it's enough to configure available devices using the
--gpus
parameter. Also I think within Docker,CUDA_VISIBLE_DEVICES
might in fact need to be in consecutive order (or probably does not need to be specified within the container), and that you should specify the real device mapping when you start docker. I'm a bit on thin ice here with this as I haven't ever run using such a configuration.
How do you do it on windows? It used work perfectly but when I run the project a few days layer it gets stuck.
I'm not sure what's the exact location and don't have Windows access right now. But here's how you should be able to figure it out:
Change torch_utils/custom_ops.py
as follows:
diff --git a/torch_utils/custom_ops.py b/torch_utils/custom_ops.py
index 4cc4e43..4dfcef7 100755
--- a/torch_utils/custom_ops.py
+++ b/torch_utils/custom_ops.py
@@ -20,7 +20,7 @@ from torch.utils.file_baton import FileBaton
#----------------------------------------------------------------------------
# Global options.
-verbosity = 'brief' # Verbosity level: 'none', 'brief', 'full'
+verbosity = 'full' # Verbosity level: 'none', 'brief', 'full'
#----------------------------------------------------------------------------
# Internal helper funcs.
Then run for example generate.py with default options, and check the logs. On my computer, it prints something like this:
Using /scratch/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /scratch/.cache/torch_extensions/bias_act_plugin/build.ninja...
This should reveal the Windows location for you.
I'm not sure what's the exact location and don't have Windows access right now. But here's how you should be able to figure it out:
Change
torch_utils/custom_ops.py
as follows:diff --git a/torch_utils/custom_ops.py b/torch_utils/custom_ops.py index 4cc4e43..4dfcef7 100755 --- a/torch_utils/custom_ops.py +++ b/torch_utils/custom_ops.py @@ -20,7 +20,7 @@ from torch.utils.file_baton import FileBaton #---------------------------------------------------------------------------- # Global options. -verbosity = 'brief' # Verbosity level: 'none', 'brief', 'full' +verbosity = 'full' # Verbosity level: 'none', 'brief', 'full' #---------------------------------------------------------------------------- # Internal helper funcs.
Then run for example generate.py with default options, and check the logs. On my computer, it prints something like this:
Using /scratch/.cache/torch_extensions as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /scratch/.cache/torch_extensions/bias_act_plugin/build.ninja...
This should reveal the Windows location for you.
Thank you! The cache for windows can be found in 'C:\Users\<user_name>\AppData\Local\torch_extensions\torch_extensions\Cache'. I was able to delete it but also had to reinstall ninja to build bias_act_plugin again. In the end, it worked.
removing the stale lock file ~/.cache/torch_extensions/py310_cu121/bias_act_plugin/3cb576a0039689487cfba59279dd6d46-nvidia-geforce-rtx-2060/lock worked for me