ActionRunners get stuck when there is transient connection loss to RabbitMQ/ Zookeeper
sravs-dev opened this issue · 1 comments
SUMMARY
We have st2 HA setup in k8s environment. We use zookeeper for coordination backend . We observed that actionrunners/schedulers/workflowengines are hung when there is transient connectivity issues with Mongo/RabbitMQ/Zookeeper.
Mongo/rabbitmq/zk containers get restarted due to k8s maintenance operations
STACKSTORM VERSION
st2 3.7.0, on Python 3.6.8
OS, environment, install method
st2 helm charts in kubernetes. CentOS base image.
Steps to reproduce the problem
Introduce connectivity errors by restarting RabbitMQ/Mongo.
st2 services loose connection to rabbitmq or mongo , they try to reconnect automatically. When the retry count exceeds , the service is hung even after the RabbitMQ/Mongo comes up.
Expected Results
No of retries and backoff time for retries should be configurable so that we can customize for the individual st2 deployments.
Or
St2 Services should be configured to exit on connectivity failures after the retry threshold is reached. In k8s environment, the containers will be auto restarted by k8s when process with pid#1 dies.
Actual Results
St2 services - actionrunner, scheduler, workflowengine, rulesengine are hung and are not able to serve traffic. Manual restart of these services are needed to resolve the issue.
In a HA setup, restarting all services would take about 15-20 minutes which is an outage.
Recommendation
RabbitMQ errors seem to be coming from here https://github.com/StackStorm/st2/blob/master/st2common/st2common/transport/consumers.py#L197 . exit_on_error can be in st2.conf with default as false.
Similar setting for Mongo would help. I would like to hear the thoughts from the maintainers. Happy to fix and test with some guidance.