CrashLoopBackoff for NUC11i7 nodes

Question

CrashLoopBackoff for NUC11i7 nodes

Closed this issue 2 years ago · 10 comments

I'm seeing the following on NUC11i7 nodes:

Traceback (most recent call last):
  File "/app/intel-gpu-exporter.py", line 47, in <module>
    REGISTRY.register(DataCollector(f"http://{host}:{port}/metrics"))
  File "/usr/local/lib/python3.11/site-packages/prometheus_client/registry.py", line 40, in register
    names = self._get_names(collector)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prometheus_client/registry.py", line 80, in _get_names
    for metric in desc_func():
  File "/app/intel-gpu-exporter.py", line 21, in collect
    power_watts = data[1]["power"]["value"]
                  ~~~~~~~^^^^^^^^^
KeyError: 'power'

for nfd annotations they have:

                    nfd.node.kubernetes.io/extended-resources:
                    nfd.node.kubernetes.io/feature-labels: intel.feature.node.kubernetes.io/gpu,pci-0300_8086.present,pci-0300_8086.sriov.capable
                    nfd.node.kubernetes.io/worker.version: v0.12.0

and the intel labels are here:

                    feature.node.kubernetes.io/pci-0300_8086.present=true
                    feature.node.kubernetes.io/pci-0300_8086.sriov.capable=true

If it shouldn't work with it that's cool too, or maybe I've done something wrong. I just thought I'd give a heads up.

Answer 1 · 2023-01-27T16:49:34.000Z

It works fine on NUC10i7 nodes. - FYI

Answer 2 · 2023-01-27T16:55:33.000Z

Any help trying to fix this would be great because I haven't noticed any issues on my NUC8 and I don't have access to a NUC11. The script isn't too difficult to try and grok what is happening. My guess is that the intel_gpu_top included in bullseye doesn't work with 11th Gens or there needs to be a conditional set in the code that handles logic around each Gen.

I believe @mitchross also had this issue on his nodes.

Answer 3 · 2023-01-27T17:00:57.000Z

I was trying to figure out the command used in it, but the command doesn't work on any of my nodes which is weird as I'd expect it to fail on all of them if required. I see the command:

/usr/bin/timeout -k 2 2 /usr/bin/intel_gpu_top -J

Is that the correct command? It looks like I don't have intel-gpu-tools installed on any nodes but it is all detected correctly via intel-device-plugin, etc.

Are the tools a requirement for this?

Answer 4 · 2023-01-27T17:03:16.000Z

The CLI tool is installed in the container. To run it on your machines you will need to install the intel-gpu-tools package. If you could then run intel_gpu_top -J and paste the output here would be great.

Answer 5 · 2023-01-27T17:05:31.000Z

yeah, I see no power value in the JSON from the command. Just a warning, this'll be big:

{
	"period": {
		"duration": 0.013862,
		"unit": "ms"
	},
	"frequency": {
		"requested": 0.000000,
		"actual": 0.000000,
		"unit": "MHz"
	},
	"interrupts": {
		"count": 0.000000,
		"unit": "irq/s"
	},
	"rc6": {
		"value": 100.000000,
		"unit": "%"
	},
	"engines": {
		"Render/3D/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"Blitter/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"Video/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"Video/1": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"VideoEnhance/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		}
	}
},
{
	"period": {
		"duration": 1000.496070,
		"unit": "ms"
	},
	"frequency": {
		"requested": 0.000000,
		"actual": 0.000000,
		"unit": "MHz"
	},
	"interrupts": {
		"count": 0.000000,
		"unit": "irq/s"
	},
	"rc6": {
		"value": 100.000000,
		"unit": "%"
	},
	"engines": {
		"Render/3D/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"Blitter/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"Video/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"Video/1": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		},
		"VideoEnhance/0": {
			"busy": 0.000000,
			"sema": 0.000000,
			"wait": 0.000000,
			"unit": "%"
		}
	}
}

Answer 6 · 2023-01-27T17:07:01.000Z

Cool, thanks for that. I'll try to get a fix in later.

Answer 7 · 2023-01-27T19:37:02.000Z

Can you test the latest image and see if it works?

ghcr.io/onedr0p/intel-gpu-exporter:rolling@sha256:4e4cf21ce97b20081503ae6b02a27f18881dfdefa80315e5422f279dd11ab002

Answer 8 · 2023-01-27T19:51:34.000Z

It's no longer crashing. I was only starting to look in to how to view the data so I don't have a good way to verify that it's pulling anything for the NUC11i7, but it's a lot better not having it blow up :).

Answer 9 · 2023-01-27T19:56:15.000Z

Yeah, it seems you will be missing power at least. Hope everything else works thou.

Answer 10 · 2023-01-27T21:38:25.000Z

Tested also... Not crashing,.