catboost/catboost

If multiple GPUs are present at server and devices parameter is set to specific GPU, catboost allocates GPU memory at other GPUs

dremovd opened this issue · 1 comments

Problem:
If multiple GPUs are present at server and devices parameter is set to specific GPU, catboost allocates GPU memory at other GPUs

catboost version:
1.2.5
Operating System:
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy

CPU:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-254
Off-line CPU(s) list: 255
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7713 64-Core Processor
GPU:
nvidia-smi
Wed Apr 24 11:25:21 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02 Driver Version: 535.146.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 90% 26C P8 27W / 400W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:23:00.0 Off | Off |
| 90% 25C P8 26W / 400W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 On | 00000000:41:00.0 Off | Off |
| 90% 26C P8 19W / 400W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce RTX 4090 On | 00000000:61:00.0 Off | Off |
| 90% 25C P8 24W / 400W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce RTX 4090 On | 00000000:81:00.0 Off | Off |
| 90% 27C P8 28W / 400W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce RTX 4090 On | 00000000:A1:00.0 Off | Off |
| 90% 27C P8 38W / 400W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce RTX 4090 On | 00000000:C1:00.0 Off | Off |
| 90% 26C P8 33W / 400W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce RTX 4090 On | 00000000:E1:00.0 Off | Off |
| 90% 28C P8 32W / 400W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

catboost_params = {
'iterations': 5000,
'learning_rate': 0.02,
'max_depth': 7,
'random_state': 0,
'task_type': 'GPU',
'devices': '0',
'gpu_ram_part': 0.85,
'border_count': 64,
}

ek-ak commented

Hello!
Please try to limit the list of available devices with environment variable CUDA_VISIBLE_DEVICES.