[Bug]: Windows node has gone stale and no longer showing in the parent Netdata instance.
Closed this issue · 9 comments
Bug description
A windows node, running the prometheus endpoint is now showing as 'stale' in the netdata cloud (NC) this was added as a vnode. It used to show but for some reason now it does now.
It also does not show at the end of the list in the Linux node which is scraping its data. I can connect
Expected behavior
I should actually be seeing 2 windows nodes at the bottom of the screenshot on the right side. As you can see in the screenshot neither are showing.
Steps to reproduce
Try to view the windows node in the list under the linux host that is configured locally to run netdata.
This is the same output when viewing via the cloud login or going direct to the local IP of netdata.
I can also curl to the prometheus endpoint running on the windows host.
Screenshots
![here there should be 2 windows nodes showing](https://private-user-images.githubusercontent.com/8769856/311286107-57287037-3f52-42c6-8a85-66fbdd4e9be8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk5OTYzMzQsIm5iZiI6MTcxOTk5NjAzNCwicGF0aCI6Ii84NzY5ODU2LzMxMTI4NjEwNy01NzI4NzAzNy0zZjUyLTQyYzYtOGE4NS02NmZiZGQ0ZTliZTgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcwMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MDNUMDg0MDM0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9Y2E1NTc0YmQ0YWExNjM5MTRlMTQ2YTk0MDAzYTdlOWIxNTRjZDczZDQwM2I5MWQyYWY2ZTI1ZWI1ZmJiMTJmZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ._Oi8tGYOS-77FHoasaqpcFs-Oa1hkGJWW3HwfTYzn30)
![windows and vnode config files](https://private-user-images.githubusercontent.com/8769856/311286116-c227a77d-4cba-4c1e-881f-532550d62a9c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk5OTYzMzQsIm5iZiI6MTcxOTk5NjAzNCwicGF0aCI6Ii84NzY5ODU2LzMxMTI4NjExNi1jMjI3YTc3ZC00Y2JhLTRjMWUtODgxZi01MzI1NTBkNjJhOWMucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcwMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MDNUMDg0MDM0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NWRjMjE4OTYzZmJhYzljZDYwZDlkY2M0YjA5M2U5ZmM0ZmRiNmRhNWY0MDcwNGNjOGQyYzdiZTYxNWUyOTI1NyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.JQszOD5vgp30g2g3PkmvtOpQawcK3BGGjTIulLwEhRE)
![metrics access ok](https://private-user-images.githubusercontent.com/8769856/311286122-f35cb63c-feec-48a3-ac22-95a6ba1038e2.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk5OTYzMzQsIm5iZiI6MTcxOTk5NjAzNCwicGF0aCI6Ii84NzY5ODU2LzMxMTI4NjEyMi1mMzVjYjYzYy1mZWVjLTQ4YTMtYWMyMi05NWE2YmExMDM4ZTIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcwMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MDNUMDg0MDM0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZTE4MjdiOTM1NWNmNTMxNWJkZWQzZDVhZjlhNTk2MzI0NjEwNDI1ZjI4YTQwYmIxMGRlYTgxZjAwZjQ3MzZkMiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.hDbGuJBu_-4O4xwR0MTyXyHYFB2DgL18xs5xOEb0Ivo)
![vnode shows as stale](https://private-user-images.githubusercontent.com/8769856/311286124-8562d84d-7a50-4c5d-aaf2-dd085aec9f95.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk5OTYzMzQsIm5iZiI6MTcxOTk5NjAzNCwicGF0aCI6Ii84NzY5ODU2LzMxMTI4NjEyNC04NTYyZDg0ZC03YTUwLTRjNWQtYWFmMi1kZDA4NWFlYzlmOTUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcwMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MDNUMDg0MDM0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NWE5N2Q4NmMwNTEyYTA2NjI1MjRmMGFjMmQ3YmNhNTc2MDQzYmIyNGI5NDA5MWQ0OWE3MGE1NDY2OGY5YjY2OSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.bCPDDrJYOq2u9dOLhhHQzk2QfxQCEBKxhzjXXIClsYI)
![curl to the windows endpoint](https://private-user-images.githubusercontent.com/8769856/311303899-bb5a9235-e4a6-476e-8d34-e9fe963ed4e2.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk5OTYzMzQsIm5iZiI6MTcxOTk5NjAzNCwicGF0aCI6Ii84NzY5ODU2LzMxMTMwMzg5OS1iYjVhOTIzNS1lNGE2LTQ3NmUtOGQzNC1lOWZlOTYzZWQ0ZTIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcwMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MDNUMDg0MDM0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NmJmNDA3ODFlNWY2MmQxOGM3NDk5NTIyOTQwYmI1NmEzMjcyNDZjZjBlZWNmMmQ1OTU5NTRhZjMzZjhjZGQwZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.NDImPvSZkAFzsATxaGpScbpfe7gMZg5dQYZi7ZJtlnc)
Error Logs
No response
Desktop
OS: MAcOS
Browser Firefox
Browser Version 123.0 64 bit
Additional context
No response
@mcdent do you have the same issue if you open your Linux machine local agent dashboard?
Have you also tried to run the go.d.plugin
in debug mode from the Linux machine? https://learn.netdata.cloud/docs/collecting-metrics/windows-systems/windows#debug-mode
Thanks for the tip @hugovalente-pm. I recreated the windows.conf and vnode.conf files, with fresh UIDs and now there seems to be no errors in the debug mode.
My remaining problem is the node which is marked as 'stale' is not collecting/sending any data, despite me being able to curl to the prometheus metric endpoint from the linux host. It shows as stale still in the linux host dashboard.
Is the fact it is marked as stay preventing this?
One other thing, no hardware spec of the windows machine is showing, I assume this is a limitation of prometheus?
Thanks
Mike
My remaining problem is the node which is marked as 'stale' is not collecting/sending any data, despite me being able to curl to the prometheus metric endpoint from the linux host. It shows as stale still in the linux host dashboard.
can you try just having just that node configured on your Linux machine and then run the debug mode? also try to access your Linux machine local agent dashboard?
Is the fact it is marked as stay preventing this?
Stale means your Linux node has past data for that node but it doesn't "find" any current data.
a gut feeling, could there be a clock sync issue on that machine?
One other thing, no hardware spec of the windows machine is showing, I assume this is a limitation of prometheus?
yes and no, atm we only get those specs for nodes where the agent is locally running there. we know we need to find a way to get this for remote Windows machines but it something currently on our backlog
My remaining problem is the node which is marked as 'stale' is not collecting/sending any data, despite me being able to curl to the prometheus metric endpoint from the linux host. It shows as stale still in the linux host dashboard.
can you try just having just that node configured on your Linux machine and then run the debug mode? also try to access your Linux machine local agent dashboard?
Ok, doing this has now brought the errant asdf123 windows machine back. I have not tried adding the config file again for the windows box wiresx1, I will do that next?
Notice although the asdf123 is now reporting, it still shows as stale?
Is the fact it is marked as stay preventing this?
Stale means your Linux node has past data for that node but it doesn't "find" any current data. a gut feeling, could there be a clock sync issue on that machine?
I checked the time on both windows boxes and they are both set to automatic and the time seems to align with that from the linux box collecting the stats.
One other thing, no hardware spec of the windows machine is showing, I assume this is a limitation of prometheus?
yes and no, atm we only get those specs for nodes where the agent is locally running there. we know we need to find a way to get this for remote Windows machines but it something currently on our backlog
IS this why I am missing memory stats for one thing?
I also note that the average CPU utilisation does not seem to align with that of the Windows 11 pc?
Task manager on windows shows a fairly steady 20% cpu (whilst running some programs) however looking back over the same period on Netdata shows it as much lower?
Am I expecting too much from the windows clients? I don't really want to run the full agent on these boxes if possible.
Thanks
Mike
Ok, lets update this here, think I have 3 days left on my trial and have been unsuccessful to add more than 1 windows node at a time.
I tried adding a completely new windows 10 pro machine, installed the prometheus endpoint via the instructions, I can curl to the endpoint just fine. Added the new windows node to the windows.conf and vnode.conf and restarted net data.
Now I see the last working windows node as stale and my new one shows up? What am I doing wrong?