StackStorm/st2

Execution rate goes down considerably when the outstanding requests are high

rajg23 opened this issue · 5 comments

rajg23 commented

SUMMARY

As a part of scale testing, I have observed that execution rate goes down considerably, when the outstanding requests are high

STACKSTORM VERSION

st2 3.8.0, on Python 3.8.10

OS, environment, install method

Post what OS you are running this on, along with any other relevant information/

Kubernetes HA, installed through helm install (oneline)

Steps to reproduce the problem

Scale up action pods to a certain number. for ex. 17. The default available threads for actions are 17x60

Each action using python runner with a sleep of 5 sec

1000 req, initiated (python threading - http api calls) gets completed in 77sec

Based on above numbers, the test case - [ Send 2500 req with sleep of 250sec, and 500 req with sleep of 250sec ] in loop

Expected Results

Given enough time for executions based on above numbers, there should not be any outstanding requests after each cycle

Actual Results

But there is pending actions, at the end of each cycle(observed in requested state). This count goes higher after few cycles. Then, the execution rate which is at 17x60 goes down to single digits(as the outstanding count goes from 5k to 20k and more)

Observed the same results with a clean mongodb, and size of 5+million docs

st2scheduler does not process requests, and hence actionrunner too. Resources are minimally used when the execution rate goes slower.

Is there a workaround to enforce scheduler to take up more requests ? When more requests are pending, the pods should work harder, but it goes the other way here

Thanks!

how are you making the Python runner sleep? Is it leveraging the eventlet/green thread capabilities to yield while sleeping?

Can you show the action that you're running to test?

what is a cycle here?

rajg23 commented

how are you making the Python runner sleep? Is it leveraging the eventlet/green thread capabilities to yield while sleeping?

Can you show the action that you're running to test?

I have tried shell and python runners for sleep which gets same results. I have not used green thread, as stackstorm should treat each remediation as green thread, or to say(the sleep in remediation does not intervene with OS sleep)

It is simple action ```
from st2common.runners.base_action import Action
import time

class scale_sleep(Action):

def run(self, sleep_time, scale_tag):

    time.sleep(int(sleep_time))
    return scale_tag

what is a cycle here?

A cycle is one full loop as described here. " [ Send 2500 req with sleep of 250sec, and 500 req with sleep of 250sec ] in loop "

If it really is the scheduler that is the bottleneck then try increasing the execution_scheduling_timeout_threshold to prevent rescheduling of actions that are not running in time.
I am guessing the scheduler is rescheduling actions that haven't started quite yet
Also, increase the pool size so it schedules more.

Generally speaking if it is the scheduler, then try to adjust the settings to get the throughput you need.

I have noticed double runs of actions in a high request volume situation. setting the execution_scheduling_timeout_threshold to 60 minutes or longer takes care of that.