elastic/rally

Rally can crash if operation(s) return too many consecutive errors

b-deam opened this issue · 1 comments

b-deam commented

If Rally is invoked with --on-error=continue then the intention is to persist with the execution of the benchmark in spite of any errors returned by an operation, however, if the operation receives too many errors as a result of a request to the target cluster we can end up crashing one or all of the Worker processes prematurely.

For example:

Running bulk-index-initial-load                                                [ 38% done]
[ERROR] Cannot race. Error in load generator [7]                                                                                        
        Cannot run task [bulk-index-initial-load]: int too large to convert to float                             
                                                                                                                 
Getting further help:                                                                                            
*********************                                                                                            
* Check the log files in /home/esbench/.rally/logs for errors.                                                   
* Read the documentation at https://esrally.readthedocs.io/en/latest/.                                           
* Ask a question on the forum at https://discuss.elastic.co/tags/c/elastic-stack/elasticsearch/rally.            
* Raise an issue at https://github.com/elastic/rally/issues and include the log files in /home/esbench/.rally/log
s.                                                                                                               
                                                                                                                 
-----------------------------------                                                                              
[INFO] FAILURE (took 81554 seconds)                                                                              
-----------------------------------                     

If we look at the logs we can see that the logs are filled with messages from the underlying elastic_transport package:

$ cat ~/.rally/logs/rally.log | grep -E 'WARNING Node|WARNING Retrying' | wc -l
76113

These logs look like this:

2023-08-21 10:31:24,552 ActorAddr-(T|:38145)/PID:20707 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://my-elasticsearch-cluster:443)> has failed for 275 times in a row, putting on 30 second timeout
2023-08-21 10:31:24,551 ActorAddr-(T|:41239)/PID:20721 elastic_transport.transport WARNING Retrying request after non-successful status 429 (attempt 0 of 3)
2023-08-21 10:31:24,552 ActorAddr-(T|:38145)/PID:20707 elastic_transport.transport WARNING Retrying request after non-successful status 429 (attempt 1 of 3)

The int too large to convert to float exception is actually generated in the elastic_transport package:

2023-08-18 11:36:04,278 ActorAddr-(T|:43513)/PID:3303 esrally.actor ERROR Worker[7] has detected a benchmark fail
ure. Notifying master...
Traceback (most recent call last):

  File "/home/esbench/.pyenv/versions/3.11.4/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 1745, in __call__
    loop.run_until_complete(self.run())

  File "/home/esbench/.pyenv/versions/3.11.4/lib/python3.11/asyncio/base_events.py", line 653, in run_until_comp$
ete
    return future.result()
           ^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 1797, in run
    _ = await asyncio.gather(*awaitables)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 1962, in __call__
    raise exceptions.RallyError(f"Cannot run task [{self.task}]: {e}") from None

esrally.exceptions.RallyError: Cannot run task [bulk-index-initial-load]: int too large to convert to float
--
    return await self.delegate(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/runner.py", line 406, in __call__
    return_value = await self.delegate(*args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/runner.py", line 339, in __call__
    return await self.delegate(self.client_extractor(args[0]), *args[1:])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/runner.py", line 526, in __call__
    response = await es.bulk(params=bulk_params, **api_kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/_async/client/__init__.py", line 705, in
bulk
    return await self.perform_request(  # type: ignore[return-value]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/client/asynchronous.py", line 353, in perform_request
    meta, resp_body = await self.transport.perform_request(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/.local/lib/python3.11/site-packages/elastic_transport/_async_transport.py", line 349, in pe
rform_request
    self.node_pool.mark_dead(node)

  File "/home/esbench/.local/lib/python3.11/site-packages/elastic_transport/_node_pool.py", line 242, in mark_dea
d
    self._dead_node_backoff_factor * (2 ** (consecutive_failures - 1)),
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

OverflowError: int too large to convert to float

The RallyAsyncElasticsearch class exposes three parameters related to whether we should even let the underlying transport retry, or how many retries, and passes them into RallyAsyncTransport:

max_retries
retry_on_status
retry_on_timeout

By default we will only retry a request 3 times according to the AsyncTransport defaults, however, the we keep a cumulative counter of all failed requests for a given node in _dead_consecutive_failures. This means that this section of code in _node_pool.py calculates a _dead_node_backoff_factor that will overflow once we've had 1025 consecutive failures:

        timeout = min(
            self._dead_node_backoff_factor * (2 ** (consecutive_failures - 1)),
            self._max_dead_node_backoff,
        )

Example:

>>> for i in range(1000,1100):
...   print(i)
...   1.0 * ( 2 ** (i-1))

[...]
1025
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
OverflowError: int too large to convert to float

The exponential backoff algorithm is the problem here.