[Failed-Request-Alert-Tuning] Investigate and Mitigate Excessive failed-request-too-high Alerts in the Alerting System

Question

[Failed-Request-Alert-Tuning] Investigate and Mitigate Excessive failed-request-too-high Alerts in the Alerting System

quiet-node opened this issue 2 months ago · 0 comments

Problem

The current Alerting System on Grafana is generating an excessive number of failed-request-too-high alerts. To ensure the system remains reliable and actionable, it is crucial to investigate the root cause of the elevated failure rates and assess whether the alerting thresholds or mechanisms need refinement.

Upon analysis, the eth_getBlockByHash and eth_getBlockByNumber endpoints have been identified as the primary drivers of the issue, contributing significantly to the recurring errors observed in the system.

Solution

The proposed solution involves analyzing the logs to determine the root cause of the failed requests and identifying the specific response codes being returned. Based on these findings, address the underlying issues by locating and resolving the bug causing the failures. This approach will help mitigate the problem and effectively reduce the volume of white-noise alerts, ensuring the alerting system remains focused on critical issues.

Alternatives

No response

Tasks

Preview Give feedback

[Failed-Request-Alert-Tuning] Traverse through logs to identify and document all issues contributing to the "failed-request-too-high" alerts.
[Failed-Request-Alert-Tuning] eth_getBlockByHash && eth_getBlockByNumber method returns too many 500 errors #3345

bug
[Failed-Request-Alert-Tuning] eth_getTransactionReceipt method returns too many 500 errors #3351

bug
[Failed-Request-Alert-Tuning] Enhanced retry mechanism for MN contract results to poll until records are fully mature #3366

enhancement
Options