Monitoring watchdog for gearman

Question

Monitoring watchdog for gearman

Opened this issue 3 years ago · 8 comments

Hello!

My gearmand (1.1.15) sometimes freezes and i want to monitore such cases and restart it. I have made a systemd service with watchdog, that starts gearmand and watchdog bash script, that every N seconds call gearadmin --status and send signal to systemd. That all ok, but my gearmand use a persistent storage and sometimes it weights 20-30Gb and gearmand need time to read data and initialize and gearadmin --status not working at that moment. Is it possible to understand outside gearmand that its all ok with it in this moment.

Thanks in advance!

Answer 1 · 2021-06-16T19:52:52.000Z

Well, 1.1.15 is a very old version. The latest release is version is 1.1.19.1. I suggest looking into upgrading.

That said, I doubt there's been any significant improvements to the persistent storage options. While not officially deprecated, active development on that feature tailed off and is in maintenance mode. The consensus best practice instead is to implement persistent storage as tasks that your workers employ. There are two frameworks for implementing such a system, one is called Gearstore and another is called Garavini. You might want to look into them. Just Google those words with Gearman to find more information. It's also easy to implement your own persistent storage tasks once you understand the design pattern.

Executing gearadmin --status often is not recommended. It locks certain data structures internal to gearmand and that locking can degrade performance. If you do it maybe once every hour, that's probably fine. Personally, I have implemented a "ping" task (all it does is return the string "pong") in my gearmand systems, and I use that as a health check for monitoring my gearmand Docker containers. I've been doing that for years now, and I've been happy with how well it has worked.

Your gearmand process weighing 20-30 GB sounds like the real problem. I don't use gearmand's persistent storage feature personally, but that could be a memory leak. As a bandage, you might want to consider restarting your gearmand process periodically (like daily perhaps). If you find a memory leak in gearmand, patches and/or more detailed information are welcome, of course.

Answer 2 · 2021-06-28T08:55:57.000Z

Well, 1.1.15 is a very old version. The latest release is version is 1.1.19.1. I suggest looking into upgrading.

That said, I doubt there's been any significant improvements to the persistent storage options. While not officially deprecated, active development on that feature tailed off and is in maintenance mode. The consensus best practice instead is to implement persistent storage as tasks that your workers employ. There are two frameworks for implementing such a system, one is called Gearstore and another is called Garavini. You might want to look into them. Just Google those words with Gearman to find more information. It's also easy to implement your own persistent storage tasks once you understand the design pattern.

Executing gearadmin --status often is not recommended. It locks certain data structures internal to gearmand and that locking can degrade performance. If you do it maybe once every hour, that's probably fine. Personally, I have implemented a "ping" task (all it does is return the string "pong") in my gearmand systems, and I use that as a health check for monitoring my gearmand Docker containers. I've been doing that for years now, and I've been happy with how well it has worked.

Your gearmand process weighing 20-30 GB sounds like the real problem. I don't use gearmand's persistent storage feature personally, but that could be a memory leak. As a bandage, you might want to consider restarting your gearmand process periodically (like daily perhaps). If you find a memory leak in gearmand, patches are welcome, of course.

@esabol

Apologize for interrupting this thread (he he). Do you have a reference (documentation, code what not) to this ping task and what it actually validates within the daemon? I am currently at a location where germand status is being used as a health check and a suitable replacement would be great to identify.

Answer 3 · 2021-06-28T20:40:57.000Z

@anderslauri asked:

Do you have a reference (documentation, code what not) to this ping task and what it actually validates within the daemon? I am currently at a location where germand status is being used as a health check and a suitable replacement would be great to identify.

It's just a task that returns the string "pong". I believe I said that. If you already have workers who have registered other tasks with gearmand, it couldn't be more trivial to implement. It's basically just like the reverse string example, but it's even simpler. If the job doesn't return "pong", the healthcheck fails and Docker will restart my gearmand container.

HEALTHCHECK --interval=5m --timeout=3s --retries=2 \
        CMD test $(/path/to/ping_test | grep -c 'pong') -eq 1 || exit 1

ping_test is just a synchronous client that submits the job 'ping' to the gearmand server and prints the output that is returned. It tests the whole system: whether gearmand is accepting jobs, whether workers are processing those jobs, and whether gearmand is successfully returning the output of jobs to clients. That seems more comprehensive to me than executing and checking the output of gearadmin --status. It does rely on your workers being up and operating, also, so it probably works best if your workers run inside the same Docker container as gearmand since the healthcheck really encompasses both the server and the workers.

Answer 4 · 2021-06-30T07:01:59.000Z

@anderslauri asked:

Do you have a reference (documentation, code what not) to this ping task and what it actually validates within the daemon? I am currently at a location where germand status is being used as a health check and a suitable replacement would be great to identify.

It's just a task that returns the string "pong". I believe I said that. If you already have workers who have registered other tasks with gearmand, it couldn't be more trivial to implement. It's basically just like the reverse string example, but it's even simpler. If the job doesn't return "pong", the healthcheck fails and Docker will restart my gearmand container.
HEALTHCHECK --interval=5m --timeout=3s --retries=2 \
        CMD test $(/path/to/ping_test | grep -c 'pong') -eq 1 || exit 1
ping_test is just a synchronous client that submits the job 'ping' to the gearmand server and prints the output that is returns. It tests the whole system: whether gearmand is accepting jobs, whether workers are processing those jobs, and whether gearmand is successfully returning the output of jobs to clients. That seems more comprehensive to me than executing and checking the output of gearadmin --status. It does rely on your workers being up and operating, also, so it probably works best if your workers run inside the same Docker container as gearmand since the healthcheck really encompasses both the server and the workers.

@esabol

That is great, thank you.

Answer 5 · 2022-07-12T22:45:43.000Z

If you are using Gearman in a container, well I just wouldn't... I would never recommend as it is as important as the O/S in many cases to what you are doing as all containers rely on some other service to be active to be started on reboots. So you are asking for problems if you go that way with it.

But anyway you mention systemd. You can set the jobs up on systemd to repair on any timeouts and issue a bunch of workers as needed an example is like this:

gearman@{1..5}.service

Then you just set WantedBy/target in systemd and scope it out properly. Systemd to be fair is pretty rock solid so you can't go far wrong as long as you feed it the proper info and avoid containers if it is important.

Answer 6 · 2022-07-12T23:26:20.000Z

As an example on systemd. I put in delayed start to the workers to make sure everything is available first:

`[Unit]
Description=Gearman Worker Daemon
After=gearmand.service mariadb.service
Requires=gearmand.service mariadb.service

[Service]
User=whateveruser
Type=simple
PIDFile=/wherever/gpworker%i.pid

KillMode=process
Restart=on-failure

ExecStart=/your-worker gearman_worker

[Install]
WantedBy=multi-user.target`

Answer 7 · 2022-07-13T08:03:59.000Z

@tomcoxcreative wrote:

If you are using Gearman in a container, well I just wouldn't... I would never recommend as it is as important as the O/S in many cases to what you are doing as all containers rely on some other service to be active to be started on reboots. So you are asking for problems if you go that way with it.

I couldn't disagree more. I've been using gearmand in a Docker container on a production system for 6 years or so, and I wouldn't use gearmand any other way. I think container technology is here to stay in modern IT infrastructure and operations. The advantages are too great, and I personally haven't encountered any downsides.

Answer 8 · 2022-07-13T08:16:51.000Z

I couldn't disagree more. I've been using gearmand in a Docker container on a production system for 6 years or so, and I wouldn't use gearmand any other way. I think container technology is here to stay in modern IT infrastructure and operations. The advantages are too great, and I personally haven't encountered any downsides.

@esabol that's interesting. I may need to reconsider my way of thinking around containers in that case. I've always avoided using them where possible.