GPU not found error. pynvml.nvml.NVMLError_NotSupported: Not Supported

Question

GPU not found error. pynvml.nvml.NVMLError_NotSupported: Not Supported

pankhuriverma opened this issue 6 months ago · 6 comments

CodeCarbon version: 2.3.4
Python version: 3.10.13
Operating System: Linux Ubuntu 22.04

I want to measure the cpu energy consumption of my python code using code carbon. My gpu version is Nvidia GeForce MX250 which does not allow energy monitoring. When I run the code I get this error because the codecarbon code is trying to find the gpu.

But when I run the same code in kaggle notebook with the same codecarbon version and python version, it is able to monitor only the cpu energy consumption when is disabled. Why is this the case?

This is the output from kaggle.

This is the code that I am running.

import tensorflow as tf

from codecarbon import EmissionsTracker

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential(
[
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10),
]
)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])

tracker = EmissionsTracker(gpu_ids=[])
tracker.start()
model.fit(x_train, y_train, epochs=10)
emissions: float = tracker.stop()
print(emissions)

Please help me identify what the issue is?

Answer 1 · 2024-03-21T21:14:16.000Z

Hello, thanks for using codecarbon!
Indeed we should check this further, there should not be any difference.
Can you provide the full log of the error when you run it in your machine?

Answer 2 · 2024-03-22T00:03:58.000Z

Hello @inimaz,

Thanks for your response. Below, you will find the complete error log.

2024-03-22 00:56:17.933449: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-22 00:56:19.940489: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/keras/src/layers/reshaping/flatten.py:37: UserWarning: Do not pass an input_shape/input_dim argument to a layer. When using Sequential models, prefer using an Input(shape) object as the first layer in the model instead.
super().init(**kwargs)
2024-03-22 00:56:23.694224: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-22 00:56:23.825903: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-22 00:56:23.826207: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-22 00:56:23.827145: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-22 00:56:23.827399: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-22 00:56:23.827656: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-22 00:56:23.928632: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-22 00:56:23.928913: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-22 00:56:23.929133: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-22 00:56:23.929549: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1928] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 1739 MB memory: -> device: 0, name: NVIDIA GeForce MX250, pci bus id: 0000:3c:00.0, compute capability: 6.1
[codecarbon INFO @ 00:56:24] [setup] RAM Tracking...
[codecarbon INFO @ 00:56:24] [setup] GPU Tracking...
[codecarbon INFO @ 00:56:24] Tracking Nvidia GPU via pynvml
Traceback (most recent call last):
File "/home/pankhuri/PycharmProjects/ThesisProject/models/codecarbon_emission_test_nn_model.py", line 25, in
tracker = EmissionsTracker(gpu_ids=[])
File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/emissions_tracker.py", line 284, in init
gpu_devices = GPU.from_utils(self._gpu_ids)
File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/external/hardware.py", line 120, in from_utils
return cls(gpu_ids=gpu_ids)
File "", line 4, in init
File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/external/hardware.py", line 62, in post_init
self.devices = AllGPUDevices()
File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/core/gpu.py", line 186, in init
gpu_device = GPUDevice(handle=handle, gpu_index=i)
File "", line 8, in init
File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/core/gpu.py", line 24, in post_init
self.last_energy = self._get_energy_kwh()
File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/core/gpu.py", line 28, in _get_energy_kwh
return Energy.from_millijoules(self._get_total_energy_consumption())
File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/codecarbon/core/gpu.py", line 95, in _get_total_energy_consumption
return pynvml.nvmlDeviceGetTotalEnergyConsumption(self.handle)
File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/pynvml/nvml.py", line 2411, in nvmlDeviceGetTotalEnergyConsumption
_nvmlCheckReturn(ret)
File "/home/pankhuri/thesis/env_thesis/venv/lib/python3.10/site-packages/pynvml/nvml.py", line 833, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported

Process finished with exit code 1

Answer 3 · 2024-03-22T09:30:55.000Z

Thanks! I am checking this further. Apparently gpu_ids needs to be a string with all the ids comma-separated.

gpu_ids="" will mean no gpus
gpu_ids="1,3,5" will mean gpus 1, 3 and 5.

For now as a workaround for your case, you can give an empty string

tracker = EmissionsTracker(gpu_ids="")
tracker.start()

When time permits we will allow to pass an array of ints.

Answer 4 · 2024-03-22T19:47:57.000Z

Hello @inimaz,

I tried it on my system but it is again giving the same error. I also think it will not work because gpu_ids only accepts a list or None. However, None is also not working in my case.

I also tried this it on kaggle but a little differently this time.

As you can see in the screenshot that I have selected GPU T4 X 2 as the accelerator and passed gpu_ids = "" as input parameter. But codecarbon is still detecting the gpu. Previously when I had run the code on kaggle I had not selected any accelerator from the kaggle menu and thats why it was not detecting. If I am correct, the gpu_ids input parameter is not working as expected.

Below you will find the logs of running the code with GPU T4 X 2 accelerator and gpu_ids = "" parameter.

[codecarbon INFO @ 19:15:11] [setup] RAM Tracking...
[codecarbon INFO @ 19:15:11] [setup] GPU Tracking...
[codecarbon INFO @ 19:15:11] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 19:15:11] [setup] CPU Tracking...
[codecarbon WARNING @ 19:15:11] No CPU tracking mode found. Falling back on CPU constant mode.
[codecarbon WARNING @ 19:15:12] We saw that you have a Intel(R) Xeon(R) CPU @ 2.00GHz but we don't know it. Please contact us.
[codecarbon INFO @ 19:15:12] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.00GHz
[codecarbon INFO @ 19:15:12] >>> Tracker's metadata:
[codecarbon INFO @ 19:15:12] Platform system: Linux-5.15.133+-x86_64-with-glibc2.31
[codecarbon INFO @ 19:15:12] Python version: 3.10.13
[codecarbon INFO @ 19:15:12] CodeCarbon version: 2.3.4
[codecarbon INFO @ 19:15:12] Available RAM : 31.358 GB
[codecarbon INFO @ 19:15:12] CPU count: 4
[codecarbon INFO @ 19:15:12] CPU model: Intel(R) Xeon(R) CPU @ 2.00GHz
[codecarbon INFO @ 19:15:12] GPU count: 2
[codecarbon INFO @ 19:15:12] GPU model: 2 x Tesla T4
[codecarbon INFO @ 19:15:15] Energy consumed for RAM : 0.000000 kWh. RAM Power : 11.759084701538086 W
[codecarbon INFO @ 19:15:15] Energy consumed for all GPUs : 0.000000 kWh. Total GPU Power : 0 W
[codecarbon INFO @ 19:15:15] Energy consumed for all CPUs : 0.000001 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 19:15:15] 0.000001 kWh of electricity used since the beginning.

Logs of running the code with no accelerator and gpu_ids = "" parameter.

Logs of running the code with no accelerator and without gpu_ids = "" parameter.

In the last two cases you can see that it is giving the same output on kaggle.

Answer 5 · 2024-04-07T16:47:57.000Z

I see the confussion, let me explain:

In the case of your local machine, codecarbon is detecting that there are GPUs but since you give gpu_ids="" ==> it does not count them. So it returns 0W in the total energy consumed by GPUs. See this line in the logs you provided

[codecarbon INFO @ 19:15:15] Energy consumed for all GPUs : 0.000000 kWh. Total GPU Power : 0 W

In the case of the kaggle notebook, codecarbon does not detect any GPU, so any id you pass in the gpu_ids will be meaningless and will not affect the output.

Hope it clarifies!

Answer 6 · 2024-04-10T00:40:34.000Z

Hi Inimaz,

The error screenshot that I provided with 0 kWh energy consumed is from kaggle. All the screenshots with red background are from kaggle.

All the screenshots below this statement "I also tried this it on kaggle but a little differently this time." in my previous message are from Kaggle in different scenarios.

This means sending gpu_ids="" on local and on kaggle is bahaving differently.

I hope you understood my issue this time.

Thanks!