ActionRunners get stuck when there is transient connection loss to RabbitMQ/ Zookeeper

Question

ActionRunners get stuck when there is transient connection loss to RabbitMQ/ Zookeeper

sravs-dev opened this issue a year ago · 1 comments

SUMMARY

We have st2 HA setup in k8s environment. We use zookeeper for coordination backend . We observed that actionrunners/schedulers/workflowengines are hung when there is transient connectivity issues with Mongo/RabbitMQ/Zookeeper.
Mongo/rabbitmq/zk containers get restarted due to k8s maintenance operations

STACKSTORM VERSION

st2 3.7.0, on Python 3.6.8

OS, environment, install method

st2 helm charts in kubernetes. CentOS base image.

Steps to reproduce the problem

Introduce connectivity errors by restarting RabbitMQ/Mongo.
st2 services loose connection to rabbitmq or mongo , they try to reconnect automatically. When the retry count exceeds , the service is hung even after the RabbitMQ/Mongo comes up.

Expected Results

No of retries and backoff time for retries should be configurable so that we can customize for the individual st2 deployments.
Or
St2 Services should be configured to exit on connectivity failures after the retry threshold is reached. In k8s environment, the containers will be auto restarted by k8s when process with pid#1 dies.

Actual Results

St2 services - actionrunner, scheduler, workflowengine, rulesengine are hung and are not able to serve traffic. Manual restart of these services are needed to resolve the issue.
In a HA setup, restarting all services would take about 15-20 minutes which is an outage.

Recommendation

RabbitMQ errors seem to be coming from here https://github.com/StackStorm/st2/blob/master/st2common/st2common/transport/consumers.py#L197 . exit_on_error can be in st2.conf with default as false.

Similar setting for Mongo would help. I would like to hear the thoughts from the maintainers. Happy to fix and test with some guidance.

Answer 1 · 2023-12-18T04:57:49.000Z

Related issues
#4775
#4958
#4731
#4020