opsgenie/opsgenie-python-sdk

Problems using Configurable Retry Policy

Closed this issue · 5 comments

asqui commented

Looking at the documentation on Configurable Retry Policy Support:

  • retry_count is documented as "Number of times a request is tried upon encountering a specific response code, such as 429." however I don't think the implementation respects this -- I believe it will make a total of retry_count attempts across any error type.
  • retry_delay and retry_max_delay documentation does not make it clear that these parameters control the full jitter algorithm. Also, they probably don't need to be integers.
  • In particular, retry_max_delay is documented as "The upper limit for the amount of time, in seconds, for which the retries will be attempted for." which is not respected, but would be useful to have. (i.e. I think you should add tenacity.stop_after_delay(configuration.retry_max_delay) to the Retrying(stop=...) argument, so that there is a way to limit the total amount of time spent retrying.)

Separately, it doesn't look like metrics are getting published during retries because opsgenie_sdk.rest.RESTClientObject.request() does not publish HTTP Metrics in the scenario where pool_manager.request() raises an exception. This is making it harder for me to debug what is going on.

asqui commented

Additional observations:

  • The Configuration object has an undocumented retries argument which (I think) is unused, but comes up in intellisense auto-complete lists.
  • The back_off parameter is not prefixed with retry_ like all the rest of the parameters for retry logic. retry_back_off might be more intuitive (and easier to find in intellisense auto-complete lists)
asqui commented
  • It's not possible to disable retry logic by setting retry_enabled = False because this parameter is quietly being ignored due to a bug. The and below needs to be an & for this to work as intended.

retry=(tenacity.retry_if_result(self.is_retry_enabled) and
((tenacity.retry_if_exception_type(RetryableException)) |
(tenacity.retry_if_exception_type(HTTPError)))))

Hi, thanks a bunch for pointing out all of this and sorry for the late response.

I have changed the explanations of the retry_count and retry_delay in the documentation to better align with what they actually do.

The bug which was preventing the disablement of retrying will be fixed, by implementing a custom callback function for deciding whether the retry should stop or not, when #29 is released.
It looks like that Tenacity doesn't support and or & operator in the retry keyword argument because when we changed to & from and, the retrying was stopped but wasn't working with it enabled.

I will remove the unused retries parameter in the Configuration object in 17f99e2

I agree that the back_off parameter of the Configuration object should be renamed to retry_back_off, but doing so might introduce some breaking changes for some customers. So, I will be postponing it to the next major release perhaps.

And I am looking into why the metrics are not being published when pool_manager.request() raises an exception at the moment

Coming back to metrics' publishing, http metrics are published only after an http request is completed. However, to make debugging easier, I have added a change to publish sdk metrics whenever an exception is thrown while attempting to call the api in b382348

zfr commented

I'm closing the ticket now. Thanks for the collaborative work guys. If we need an another improvement later, let's create a new issue then.