Sometimes the exporter may hang on /metrics indefinitly

Question

Sometimes the exporter may hang on /metrics indefinitly

sereinity opened this issue 2 years ago · 6 comments

Hello,

First thank you for this great software that help us monitor our stacks.

Summary

Sometimes fpm-exporter doesn't work well and hang when queriying /metrics appearing as an unhealthy target in prometheus.

Context

We use the exporter as a sidecar on a kubernetes engine.
We have 16 fpm-exporter per environment with the exact same configuration. But some (one or two) of them doesn't behave correctly:

Symptoms

When we query the exporter on the root (curl localhost:9253/) they responds that the metrics lives in /metrics but when querying it it never responds; and over time those instances take more and more RAM (linearly, I didn't put memory limits yet).

Killing fpm or fpm-exporter (running kill 1 in containers) may fix it (I say may because there is still a bigger memory usage that clean instances and it looks like those instances may be more prone to hang again)

Here are CPU and memory graphs of a clean instance

Here are CPU and memory graphs of an unhealty instance

Timeline:

We can see the killing of the php-fpm at 9:52
40 minutes later at 10:29, this exact instance stop working again, all 15 others no problems
then I killed fpm at 10:55 which also restored it
… until 11:34 (again ~40minutes, andstill no problem on other fpm-exporters)
then I killed both fpm and fpm-exporter at 12:06 (this time it did re-create the container inside the pod, and took <10Mb)
break again at 12:08
restart fpm at 12:25 (I planed to stop here, but as this time the memory usage were low I decided to try one last time 🤷‍♀️ )
worked until 13:23

Note: there is absolutely no relevant logs, only connection refused when restarting fpm. (I plan to test with more verbosity).

Conf sample

---
kind: Deployment
# …
spec:
  template:
    spec:
      containers:
        - name: fpm
          # …
        - name: fpm-exporter
           env:
             - name: PHP_FPM_SCRAPE_URI
               value: tcp://127.0.0.1:9001/status  # 9001 is only used to expose the status
             - name: PHP_FPM_FIX_PROCESS_COUNT
               value: "true"
            image: hipages/php-fpm_exporter:2.2.0
            imagePullPolicy: IfNotPresent
            livenessProbe:
              httpGet:
                path: /
                port: fpm-metrics
                scheme: HTTP
            ports:
            - containerPort: 9253
              name: fpm-metrics
              protocol: TCP
            readinessProbe:
              httpGet:
                path: /
                port: fpm-metrics
                scheme: HTTP

What can I do?

We can see that healthy instances takes less than 10Mb of memory, but even after a restart, the previously unhealthy instance still takes 60Mb; I would love to put a memory limit, but I unsure about setting it on something like 32Mb.

I could also put the liveness on /metrics but unsure if it's a good idea.

Ready to read your propositions folks 🙂

Answer 1 · 2022-09-12T15:15:22.000Z

I enabled debugging while reproducing the issue. But the logs aren't very helpful:

time="2022-09-12T15:00:00Z" level=debug msg="Pool[tcp://127.0.0.1:9001/status]: {\"pool\":\"www\",\"process manager\":\"dynamic\",\"start time\":1662994578,\"start since\":222,\"accepted conn\":23,\"listen queue\":0,\"max listen queue\":0,\"listen queue len\":511,\"idle processes\":2,\"active processes\":0,\"total processes\":2,\"max active processes\":1,\"max children reached\":0,\"slow requests\":0, \"processes\":[{\"pid\":15,\"state\":\"Idle\",\"start time\":1662994578,\"start since\":222,\"requests\":12,\"request duration\":20914,\"request method\":\"GET\",\"request uri\":\"/index.php\",\"content length\":0,\"user\":\"-\",\"script\":\"/app/public/index.php\",\"last request cpu\":95.63,\"last request memory\":4194304},{\"pid\":16,\"state\":\"Idle\",\"start time\":1662994578,\"start since\":222,\"requests\":11,\"request duration\":20035,\"request method\":\"GET\",\"request uri\":\"/index.php\",\"content length\":0,\"user\":\"-\",\"script\":\"/app/public/index.php\",\"last request cpu\":99.83,\"last request memory\":4194304}]}" time="2022-09-12T15:00:00Z" level=debug msg="Updated 1 pool(s) in 2.813443ms"

I got only 12 of these logs on the failing instance, while on a normal fpm-instance I got one of these logs every scrap.