Rally can crash if operation(s) return too many consecutive errors
b-deam opened this issue · 1 comments
If Rally is invoked with --on-error=continue
then the intention is to persist with the execution of the benchmark in spite of any errors returned by an operation, however, if the operation receives too many errors as a result of a request to the target cluster we can end up crashing one or all of the Worker processes prematurely.
For example:
Running bulk-index-initial-load [ 38% done]
[ERROR] Cannot race. Error in load generator [7]
Cannot run task [bulk-index-initial-load]: int too large to convert to float
Getting further help:
*********************
* Check the log files in /home/esbench/.rally/logs for errors.
* Read the documentation at https://esrally.readthedocs.io/en/latest/.
* Ask a question on the forum at https://discuss.elastic.co/tags/c/elastic-stack/elasticsearch/rally.
* Raise an issue at https://github.com/elastic/rally/issues and include the log files in /home/esbench/.rally/log
s.
-----------------------------------
[INFO] FAILURE (took 81554 seconds)
-----------------------------------
If we look at the logs we can see that the logs are filled with messages from the underlying elastic_transport
package:
$ cat ~/.rally/logs/rally.log | grep -E 'WARNING Node|WARNING Retrying' | wc -l
76113
These logs look like this:
2023-08-21 10:31:24,552 ActorAddr-(T|:38145)/PID:20707 elastic_transport.node_pool WARNING Node <RallyAiohttpHttpNode(https://my-elasticsearch-cluster:443)> has failed for 275 times in a row, putting on 30 second timeout
2023-08-21 10:31:24,551 ActorAddr-(T|:41239)/PID:20721 elastic_transport.transport WARNING Retrying request after non-successful status 429 (attempt 0 of 3)
2023-08-21 10:31:24,552 ActorAddr-(T|:38145)/PID:20707 elastic_transport.transport WARNING Retrying request after non-successful status 429 (attempt 1 of 3)
The int too large to convert to float
exception is actually generated in the elastic_transport
package:
2023-08-18 11:36:04,278 ActorAddr-(T|:43513)/PID:3303 esrally.actor ERROR Worker[7] has detected a benchmark fail
ure. Notifying master...
Traceback (most recent call last):
File "/home/esbench/.pyenv/versions/3.11.4/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/driver.py", line 1745, in __call__
loop.run_until_complete(self.run())
File "/home/esbench/.pyenv/versions/3.11.4/lib/python3.11/asyncio/base_events.py", line 653, in run_until_comp$
ete
return future.result()
^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/driver.py", line 1797, in run
_ = await asyncio.gather(*awaitables)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/driver.py", line 1962, in __call__
raise exceptions.RallyError(f"Cannot run task [{self.task}]: {e}") from None
esrally.exceptions.RallyError: Cannot run task [bulk-index-initial-load]: int too large to convert to float
--
return await self.delegate(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/runner.py", line 406, in __call__
return_value = await self.delegate(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/runner.py", line 339, in __call__
return await self.delegate(self.client_extractor(args[0]), *args[1:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/driver/runner.py", line 526, in __call__
response = await es.bulk(params=bulk_params, **api_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/_async/client/__init__.py", line 705, in
bulk
return await self.perform_request( # type: ignore[return-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/rally/esrally/client/asynchronous.py", line 353, in perform_request
meta, resp_body = await self.transport.perform_request(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/esbench/.local/lib/python3.11/site-packages/elastic_transport/_async_transport.py", line 349, in pe
rform_request
self.node_pool.mark_dead(node)
File "/home/esbench/.local/lib/python3.11/site-packages/elastic_transport/_node_pool.py", line 242, in mark_dea
d
self._dead_node_backoff_factor * (2 ** (consecutive_failures - 1)),
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
OverflowError: int too large to convert to float
The RallyAsyncElasticsearch
class exposes three parameters related to whether we should even let the underlying transport retry, or how many retries, and passes them into RallyAsyncTransport
:
max_retries
retry_on_status
retry_on_timeout
By default we will only retry a request 3 times according to the AsyncTransport
defaults, however, the we keep a cumulative counter of all failed requests for a given node in _dead_consecutive_failures
. This means that this section of code in _node_pool.py
calculates a _dead_node_backoff_factor
that will overflow once we've had 1025
consecutive failures:
timeout = min(
self._dead_node_backoff_factor * (2 ** (consecutive_failures - 1)),
self._max_dead_node_backoff,
)
Example:
>>> for i in range(1000,1100):
... print(i)
... 1.0 * ( 2 ** (i-1))
[...]
1025
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
OverflowError: int too large to convert to float
The exponential backoff algorithm is the problem here.
I raised elastic/elastic-transport-python#111