Sometimes the exporter may hang on /metrics indefinitly
sereinity opened this issue · 6 comments
Hello,
First thank you for this great software that help us monitor our stacks.
Summary
Sometimes fpm-exporter doesn't work well and hang when queriying /metrics
appearing as an unhealthy target in prometheus.
Context
We use the exporter as a sidecar on a kubernetes engine.
We have 16 fpm-exporter per environment with the exact same configuration. But some (one or two) of them doesn't behave correctly:
Symptoms
When we query the exporter on the root (curl localhost:9253/
) they responds that the metrics lives in /metrics
but when querying it it never responds; and over time those instances take more and more RAM (linearly, I didn't put memory limits yet).
Killing fpm or fpm-exporter (running kill 1
in containers) may fix it (I say may because there is still a bigger memory usage that clean instances and it looks like those instances may be more prone to hang again)
Here are CPU and memory graphs of a clean instance
Here are CPU and memory graphs of an unhealty instance
Timeline:
- We can see the killing of the php-fpm at 9:52
- 40 minutes later at 10:29, this exact instance stop working again, all 15 others no problems
- then I killed fpm at 10:55 which also restored it
- … until 11:34 (again ~40minutes, andstill no problem on other fpm-exporters)
- then I killed both fpm and fpm-exporter at 12:06 (this time it did re-create the container inside the pod, and took <10Mb)
- break again at 12:08
- restart fpm at 12:25 (I planed to stop here, but as this time the memory usage were low I decided to try one last time
🤷♀️ ) - worked until 13:23
Note: there is absolutely no relevant logs, only connection refused
when restarting fpm. (I plan to test with more verbosity).
Conf sample
---
kind: Deployment
# …
spec:
template:
spec:
containers:
- name: fpm
# …
- name: fpm-exporter
env:
- name: PHP_FPM_SCRAPE_URI
value: tcp://127.0.0.1:9001/status # 9001 is only used to expose the status
- name: PHP_FPM_FIX_PROCESS_COUNT
value: "true"
image: hipages/php-fpm_exporter:2.2.0
imagePullPolicy: IfNotPresent
livenessProbe:
httpGet:
path: /
port: fpm-metrics
scheme: HTTP
ports:
- containerPort: 9253
name: fpm-metrics
protocol: TCP
readinessProbe:
httpGet:
path: /
port: fpm-metrics
scheme: HTTP
What can I do?
We can see that healthy instances takes less than 10Mb of memory, but even after a restart, the previously unhealthy instance still takes 60Mb; I would love to put a memory limit, but I unsure about setting it on something like 32Mb.
I could also put the liveness on /metrics
but unsure if it's a good idea.
Ready to read your propositions folks
I enabled debugging while reproducing the issue. But the logs aren't very helpful:
time="2022-09-12T15:00:00Z" level=debug msg="Pool[tcp://127.0.0.1:9001/status]: {\"pool\":\"www\",\"process manager\":\"dynamic\",\"start time\":1662994578,\"start since\":222,\"accepted conn\":23,\"listen queue\":0,\"max listen queue\":0,\"listen queue len\":511,\"idle processes\":2,\"active processes\":0,\"total processes\":2,\"max active processes\":1,\"max children reached\":0,\"slow requests\":0, \"processes\":[{\"pid\":15,\"state\":\"Idle\",\"start time\":1662994578,\"start since\":222,\"requests\":12,\"request duration\":20914,\"request method\":\"GET\",\"request uri\":\"/index.php\",\"content length\":0,\"user\":\"-\",\"script\":\"/app/public/index.php\",\"last request cpu\":95.63,\"last request memory\":4194304},{\"pid\":16,\"state\":\"Idle\",\"start time\":1662994578,\"start since\":222,\"requests\":11,\"request duration\":20035,\"request method\":\"GET\",\"request uri\":\"/index.php\",\"content length\":0,\"user\":\"-\",\"script\":\"/app/public/index.php\",\"last request cpu\":99.83,\"last request memory\":4194304}]}" time="2022-09-12T15:00:00Z" level=debug msg="Updated 1 pool(s) in 2.813443ms"
I got only 12 of these logs on the failing instance, while on a normal fpm-instance I got one of these logs every scrap.