googleapis/python-spanner

database.batch does not retry aborted transactions

olavloite opened this issue · 4 comments

The standard example for writing data with mutations uses database.batch: https://cloud.google.com/spanner/docs/getting-started/python#write-data-with-mutations

database.batch however does not automatically retry the transaction if it is aborted by Spanner. This causes errors if you try to use this method to insert a large amount of data, or if there are lock contentions on the data that you insert.

Either:

  1. The sample(s) should be updated to show how to use run_in_transaction to use mutations.
  2. And/or: database.batch should also automatically retry aborted transactions.

After reviewing the database.batch API, we found that commit calls within a batch are automatically retried, with a default retry count of 5. You can refer to this line in the code for more details: link. The batch.insert API simply appends values to the mutations list, and these values are persisted in the database when the commit API is called on the Spanner client. This behavior is consistent across all Batch APIs defined here: link.

It's important to note that retries are only triggered in the case of an InternalServerError exception. For more information, refer to: link. However, I don't believe these retries are triggered if an Aborted exception occurs during a transaction execution.

Yeah, this bug is specifically for Aborted errors. If the transaction is aborted by Spanner (meaning: Spanner returns an error with error code Aborted), then the transaction should be retried. The retry mechanism should be the same as for run_in_transaction; It should do a back-off and retry, using the back-off value that is included in the Aborted error. There should not be a maximum number of retries, instead it should stop retrying if the deadline has been exceeded.

I understand. The run_in_transaction method currently evaluates the deadline value using the keyword arguments, as defined here: link. In my opinion, it's the client's responsibility to pass the timeout (in seconds) as part of the keyword arguments for the transaction. For our use case, how should we handle the evaluation of the deadline value? Should we define a default timeout for all mutation operations and use that to set the deadline, or is there another approach we should consider.