cloudera/impyla

Retry logic can incorrectly repeat execution in non-http connections

csringhofer opened this issue · 0 comments

Found this during the investigation of #547:

  • the slow DROP TABLE in Impala led to hitting the 5 sec timeout in the test
    • note that the timeout was not caused by network, but simply the time taken by the synchronous execute_statement
  • in reality 2 DROP TABLE statements were executed in Impala:
    • the 1. failing with timeout
    • the 2. failing with "table doesn't exist", as the 1. was actually executed by Impala, only the client timeouted

Was also able to reproduce it with INSERT statements, sometimes even executing it 3 times, so I consider this to be a very serious correctness issue.
Not sure if other (network) errors can also trigger this, or only timeouts.

The bug is in ThriftRpc._execute():

return func(request)

The intention seems to be to only retry connection failures, but exceptions from the actual RPC also lead to retry, which is not ok if the RPC is not idempotent, as coordinator may be already executing the statement at this point.

http connections only retry idempotent RPCs (#424), so the issue doesn't happen there.

The issue seems to be in Impala since a very long time - the problematic code was added in #153, but my impression is that the bug existed even before the refactor. It is possible that bumping Thrift versions also affected this by changing the type of exception thrown. Only checked with 0.20a1 and 0.19.0 so far.