Various Device Timeouts
iTIcnIgno opened this issue · 0 comments
Been banging my head against a wall for about 2 months with no answers in sight, so I set out to determine at least a root cause this today.
Two devices, UAP-AC-Pro and a USG-Pro4. The AP was monitored by a Zabbix Proxy and the USG was monitored directly by the Zabbix server I have deployed. Both of these devices were reporting an MCA timeout. I first noticed this issue when I had general issues with trying to upgrade my Zabbix server from v6.0 to v6.4. Every attempt I made at running the upgrade failed miserably, so it prompted me to purge all Zabbix packages from the server and start over. I exported my hosts and templates only, figuring if there was an issue with the database then restoring a bad database would leave me right where I started. Did a clean install of the beta version of Zabbix 7 and worked on getting everything re-deployed. Enter the issue at hand.
The AC-Pro which only communicated directly with the Proxy stopped reporting data. Aside from upgrading the Proxy to 7.0, none of the script files have changed. Meanwhile on the Zabbix Server, when I re-deployed, I copied the latest version of the script from Github.
For the first time in weeks I've had time to sit down and resolve this, but it's outside of my wheelhouse to explain why with greater detail. Since I manage multiple Zabbix servers that are in actual production environments that are also restricted by various security platforms, so the hashes cannot change or I have to go in and revise security policies, and the goal is to do that as little as possible. The bad part is that I don't have a specific date for when I reinstalled on my server and started to experience this issue, but it would have been at some point in the month of March of this year.
To get this issue resolved, I reverted back to the copy that I use in my production deployment, and my file is dated April 19th 2023, though that date got lost in translation when I had to transfer the file from my work repository to my home computer. I've attached the older file I guess for comparative purposes since aside from doing a side by side comparison, it's outside of my wheelhouse to explain why a newer line would cause this kind of timeout failure.
Furthermore, the newer script was working without fail for several other devices. US-24-250W, a separate UAP-AC-Pro in a different location, and a UAP-HD were all reporting correctly. After rolling the script back to this older version, all the devices are still reporting everything properly, but the these two devices now started reporting properly. Firmware updates, firmware rollbacks, Zabbix updates, etc, none affected whether the two affected devices started working until I switched to this older mca-dump-short version. From what I can tell, the ssh-run had no bearing on this.
Posting this here in the hope that maybe it locates other timeout issues or heads towards a resolution. My issue might be resolved, but in general, there is a larger issue to be resolved, but wanted to bring it to attention in the hopes that it sheds more light on what is probably affecting more than just me at home.