dart-lang/labs

The retry datastore impl. is causing increased transaction aborted errors and prevents updates to the datastore

Closed this issue · 16 comments

Adding retries in the back ground has caused an increase in transaction aborted errors with datastore. When upgrading to this client there is no exposure to set the retry options and using this implementation has caused objects in our datastore to not be updated.

This issue was encountered last week on Monday. None of our tasks could be updated as we were getting transaction aborted errors with almost ever access to the database.

image

This issue happened when this package was added to appengine along with the following updates:

package good version bad version
appengine 0.13.5 0.13.6
gcloud 0.8.9 0.8.10
grpc 3.2.2 3.2.3

Downgrading to the previous versions removes the transaction aborted errors and we have no issue updating our datastore.

For further context as it took some time to debug this:

flutter/flutter#131310

There is some bug in this code that when used will add and increase the amount of transaction aborted errors as these errors are not observed with the previous version.

isoos commented

On the screenshot I see one failure, reported in two log lines (same timestamp), but otherwise it doesn't seem to be an actual retry with the transaction. We are explicit that if the transaction is aborted with the proper error code, then it is not retried: https://github.com/dart-lang/gcloud/pull/168/files#diff-c70ebb2957892df2cc4ec8e9b8d5ee58861f5f8d9d1aec1d32cc3e16e707d050R152

Is there more detail what happens inside that transaction?

Both your logic and ours are based on catching TransactionAbortedError, but it's not clear where that exception will be thrown

I see only a couple of error code handling: https://github.com/dart-lang/gcloud/blob/12e82a1ad3ce22160b6799dcb1675a7311b85efb/lib/src/datastore_impl.dart#L288

From our log,

Failed to process Instance of 'PushMessage'. (500) The transaction was aborted. #0 

It's a 500 error. We don't seem to touch the retry logic either from our own retry implementation.

@isoos Here is the exact error:
image

Within a 2 hour window here with the retry datastore implementation we have 600+ errors with transaction aborted reported: https://pantheon.corp.google.com/logs/query;query=resource.type%3D%22gae_app%22%0Aresource.labels.module_id%3D%22default%22%0Aresource.labels.version_id%3D%22version-803a94e%22%0Aseverity%3DERROR%0A%0A-%22%2Fapi%2Fdart-internal%22;cursorTimestamp=2023-07-31T11:45:05.170Z;startTime=2023-07-31T11:43:54.000Z;endTime=2023-07-31T13:25:06.000Z?serviceId=default&project=flutter-dashboard

To contrast that here after we deployed this morning (08/01/2023) with reverted versions we have 2 of those errors in the last 6 hours:
https://pantheon.corp.google.com/logs/query;query=resource.type%3D%22gae_app%22%0Aresource.labels.module_id%3D%22default%22%0Aresource.labels.version_id%3D%22version-24f2a77%22%0A-%22%2Fapi%2Fdart-internal%22;cursorTimestamp=2023-08-01T21:26:36.851Z;duration=PT6H?serviceId=default&project=flutter-dashboard

Screenshot shows the difference in the implementation as this is from the previous version without retries:
image

Something in your retry implementation is too aggressive and all the conflicts with transaction aborted prevent us from upgrading our entities in Datstore.

I think the links in pantheon may expire as the versions get redeployed.

isoos commented

Neither gcloud nor appengine has an error starting with "Failed to process", that seems to be coming from a different place.

I might be wrong but the screenshots attached indicate no actual retry, only double-reporting. @jonasfj may be also check the linked internal links.

The "Failed to process" is from our subscription handler logging. It is processing the actual request. That part is just repeating the actual exception that occurred with the abort. The message that starts with "The transaction was aborted..." is the untouched (by us) exception that is encountered from the datastore call.

That is correct it is double reporting. And now I am curious as to why it is not retrying and producing copious amounts of errors though.

Both your logic and ours are based on catching TransactionAbortedError, but it's not clear where that exception will be thrown

I think TransactionAbortedError will be thrown by datastore.Datastore.

In package:appengine version 0.13.6 we wrapped the default datastore.Datastore object, with RetryDatastoreImpl.
The RetryDatastoreImpl does not retry TransactionAbortedError. Indeed it is your responsibility to handle those.

I think the correct way to do this, is something like:

// Important: we don't just retry the commit operation!
//            We retry: lookup, check, modify, and commit!
//            Such that we lookup new data.
await retry(
  () async {
    // Create a transaction
    await dbService.withTransaction((tx) async {
      // Lookup data that we want to change
      final entity = await tx.lookup(...);

      // Check if changes are necessary
      if (entity.value == newValue) {
        await tx.rollback();
        return;
      }

      // Make changes
      entity.value = newValue;

      // Commit the changes we made
      await tx.commit(insert: [entity]);
    });
  },
  // If the operation above failed because of a transaction issue
  // this could be because the transaction had a conflict, very
  // typical if you have write-congestion on the same entity.
  //
  // When this happens we retry the entire transaction, we lookup
  // the entity again, make changes and try to commit it again.
  // Note that we always check if changes are necessary!
  // This is because the entity might have been changed by a
  // concurrent process, perhaps one that made the exact same
  // change we were trying to make.
  retryIf: (e) => e is TransactionAbortedError,
);

®ricardoamador

In the logs I also see a lot of 400s with a log line:
Failed to process Instance of 'PushMessage'. (400) null is unknown

So I am confused as this code in the retry_datastore_impl specifically specifies a retryIf function which checks for transactionAborted: https://github.com/dart-lang/gcloud/blob/12e82a1ad3ce22160b6799dcb1675a7311b85efb/lib/src/retry_datastore_impl.dart#L152.

Which is run when commit is called here: https://github.com/dart-lang/gcloud/blob/master/lib/src/retry_datastore_impl.dart#L63

The delegate is being set up to run the commit with retries on transaction aborted errors. Not sure if that is intentional but it is being run as you can see it in the stack trace.

@jonasfj the 400 null is unknown is something else (LUCI) and has since been fixed.

isoos commented

So I am confused as this code in the retry_datastore_impl specifically specifies a retryIf function which checks for transactionAborted:

Note: that if statement returns false, so retry won't happen:
https://github.com/dart-lang/gcloud/blob/12e82a1ad3ce22160b6799dcb1675a7311b85efb/lib/src/retry_datastore_impl.dart#L156

Ah okay. Yeah it definitely is a datastore.Transaction aborted from the grpc impl. The perhaps this error is a symptom of something else. I am not sure what though just that we see an increase in these errors with the retry client.

Okay I enabled audit logging in our gcp project for datastore. The issue now will be finding time to run a test since this negatively impacts our operations.

It's technically possible that the commit request fails after the server got the payload.
And hence, a retry happens from the client, but it fails on the server because it already received the commit request for the given transaction.

But I'm not sure this would cause a TransactionAbortedError. Also seems like something that'd be rare, normally requests fails when sending, not while reading the response.


I'm very interested to hear what you learn :)

Okay I meant to get back to this but got sidetracked. We never ran any tests as the issue seemed to have resolved. I cannot say for sure how or why though but am going to close this as not needed. Thanks @jonasfj and @isoos ! Most likely a bug within our code.