onedr0p/intel-gpu-exporter

CrashLoopBackoff for NUC11i7 nodes

Closed this issue ยท 10 comments

I'm seeing the following on NUC11i7 nodes:

Traceback (most recent call last):
  File "/app/intel-gpu-exporter.py", line 47, in <module>
    REGISTRY.register(DataCollector(f"http://{host}:{port}/metrics"))
  File "/usr/local/lib/python3.11/site-packages/prometheus_client/registry.py", line 40, in register
    names = self._get_names(collector)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prometheus_client/registry.py", line 80, in _get_names
    for metric in desc_func():
  File "/app/intel-gpu-exporter.py", line 21, in collect
    power_watts = data[1]["power"]["value"]
                  ~~~~~~~^^^^^^^^^
KeyError: 'power'

for nfd annotations they have:

                    nfd.node.kubernetes.io/extended-resources:
                    nfd.node.kubernetes.io/feature-labels: intel.feature.node.kubernetes.io/gpu,pci-0300_8086.present,pci-0300_8086.sriov.capable
                    nfd.node.kubernetes.io/worker.version: v0.12.0

and the intel labels are here:

                    feature.node.kubernetes.io/pci-0300_8086.present=true
                    feature.node.kubernetes.io/pci-0300_8086.sriov.capable=true

If it shouldn't work with it that's cool too, or maybe I've done something wrong. I just thought I'd give a heads up.

It works fine on NUC10i7 nodes. - FYI

Any help trying to fix this would be great because I haven't noticed any issues on my NUC8 and I don't have access to a NUC11. The script isn't too difficult to try and grok what is happening. My guess is that the intel_gpu_top included in bullseye doesn't work with 11th Gens or there needs to be a conditional set in the code that handles logic around each Gen.

I believe @mitchross also had this issue on his nodes.

I was trying to figure out the command used in it, but the command doesn't work on any of my nodes which is weird as I'd expect it to fail on all of them if required. I see the command:

/usr/bin/timeout -k 2 2 /usr/bin/intel_gpu_top -J

Is that the correct command? It looks like I don't have intel-gpu-tools installed on any nodes but it is all detected correctly via intel-device-plugin, etc.

Are the tools a requirement for this?

The CLI tool is installed in the container. To run it on your machines you will need to install the intel-gpu-tools package. If you could then run intel_gpu_top -J and paste the output here would be great.

yeah, I see no power value in the JSON from the command. Just a warning, this'll be big:

{
	"period": {
		"duration": 0.013862,
		"unit": "ms"
	},
	"frequency": {
		"requested": 0.000000,
		"actual": 0.000000,
		"unit": "MHz"
	},
	"interrupts": {
		"count": 0.000000,
		"unit": "irq/s"
	},
	"rc6": {
		"value": 100.000000,
		"unit": "%"
	},
	"engines": {
		"Render/3D/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"Blitter/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"Video/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"Video/1": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"VideoEnhance/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		}
	}
},
{
	"period": {
		"duration": 1000.496070,
		"unit": "ms"
	},
	"frequency": {
		"requested": 0.000000,
		"actual": 0.000000,
		"unit": "MHz"
	},
	"interrupts": {
		"count": 0.000000,
		"unit": "irq/s"
	},
	"rc6": {
		"value": 100.000000,
		"unit": "%"
	},
	"engines": {
		"Render/3D/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"Blitter/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"Video/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"Video/1": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"VideoEnhance/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		}
	}
}

Cool, thanks for that. I'll try to get a fix in later.

Can you test the latest image and see if it works?

ghcr.io/onedr0p/intel-gpu-exporter:rolling@sha256:4e4cf21ce97b20081503ae6b02a27f18881dfdefa80315e5422f279dd11ab002

It's no longer crashing. I was only starting to look in to how to view the data so I don't have a good way to verify that it's pulling anything for the NUC11i7, but it's a lot better not having it blow up :).

Yeah, it seems you will be missing power at least. Hope everything else works thou.

Tested also... Not crashing,.