How to handle timeout errors

First of all, thank you for this project, it really helped us during our migration to Kubernetes.

We are running our API in a container based on php:7.3-fpm-alpine and having Prometheus pull FPM metrics through this container (hipages/php-fpm_exporter:1). We have an HPA for this deployment that uses the phpfpm_active_processes metric to scale (pm is set to static).

Yesterday we had a huge traffic bump, a DDoS of some sorts, and our stack failed spectacularly, the replicaset failed to scale as prometheus failed to pull metrics. Investigating the logs we found bunch of dial tcp 127.0.0.1:9000: i/o timeout errors.

My question is: is there a way to guarantee that metrics reach Prometheus even if the FPM container is under heavy load? Maybe a way to increase the timeout?

Let me know if more details are needed.

@szokeptr We had similar issues and ended up having enough headroom to scale before reaching 100%.

The problem is PHP-FPM doesn't have a dedicated thread for returning metrics (/status). It's all in the same queue with regular requests. A large backlog will therefore take longer than the currently configured timeout in php-fpm_exporter.

We could make the timeout configurable (see

php-fpm_exporter/phpfpm/phpfpm.go

Line 159 in 4d2aa1a

    
           fcgi, err := fcgiclient.DialTimeout(scheme, address, time.Duration(3)*time.Second)

), so it can be aligned with the Prometheus scrape interval.

Thoughts?

@estahn I think making the timeout configurable would help, since the issue is solved if the deployment could scale up.
For now we'll probably use some other metric for scaling to avoid this issue.

Anyways, thanks for the quick reply!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

This is an interesting question that I have also considered before. I once thought about using an independent pool to run status, but unfortunately, the status of PHP fpm can only obtain the status of the pool it is in, so it cannot be isolated through an independent pool. Unfortunately, PHP FPM did not prioritize the processing of status requests, unlike MySQL, which retains a thread for root users to log in even under high load, so that root users can save the scene. Instead of watching MySQL die due to high load. But php-fpm doesn't seem to have such a mechanism, it's silly

Actually, I feel like php-fpm_exporter can add a mechanism that returns the maximum number of active processes in PHP fpm when the request status times out and still times out after a specified number of attempts_ The children value, idle is 0. This simulates data return. At least this can allow HPA to continue working, rather than not returning any data, causing Prometheus to be unable to collect data, and then the K8S end will calculate an abnormal proportion of active processes within a certain time period, triggering some unpredictable situations.

I think providing snapshot data should solve this problem. The exporter runs in sidecar mode and is responsible for providing snapshot data externally. This way, even if the status interface responds slowly under high load on the business container, the exporter exposes snapshot data to the outside world. At the same time, it regularly updates the current snapshot by obtaining data from PHP FPM and ensures a millisecond level response,

This way, the situation shown in the following figure will not occur

Now, due to the fact that the exporter is capturing data in real-time from phpfpm, under high load conditions, php-fpm responds very slowly, resulting in the exporter being unable to respond to Prometheus' capture within the expected time, ultimately leading to unreliable hpa. This is why I gave up on hipages/php-fpm_exporter The reason for the exporter, including similar issues with other PHP fpm exporters, I plan to develop one myself

Fortunately, the PHP official provided a configuration to address this issue at 8

https://www.php.net/manual/en/migration80.new-features.php#migration80.new-features.fpm

Therefore, it is lower than php8 and can only be solved through snapshots