Retry logic can incorrectly repeat execution in non-http connections
csringhofer opened this issue · 0 comments
Found this during the investigation of #547:
- the slow DROP TABLE in Impala led to hitting the 5 sec timeout in the test
- note that the timeout was not caused by network, but simply the time taken by the synchronous execute_statement
- in reality 2 DROP TABLE statements were executed in Impala:
- the 1. failing with timeout
- the 2. failing with "table doesn't exist", as the 1. was actually executed by Impala, only the client timeouted
Was also able to reproduce it with INSERT statements, sometimes even executing it 3 times, so I consider this to be a very serious correctness issue.
Not sure if other (network) errors can also trigger this, or only timeouts.
The bug is in ThriftRpc._execute():
Line 1167 in 9a853c1
The intention seems to be to only retry connection failures, but exceptions from the actual RPC also lead to retry, which is not ok if the RPC is not idempotent, as coordinator may be already executing the statement at this point.
http connections only retry idempotent RPCs (#424), so the issue doesn't happen there.
The issue seems to be in Impala since a very long time - the problematic code was added in #153, but my impression is that the bug existed even before the refactor. It is possible that bumping Thrift versions also affected this by changing the type of exception thrown. Only checked with 0.20a1 and 0.19.0 so far.