googleapis/python-ndb

Consistency issue

Opened this issue · 5 comments

patb2 commented
  1. API: python-ndb
  2. AppEngine
  3. Python version: Python 3.8
  4. google-cloud-ndb version: 1.9.0

Steps to reproduce

On a _post_put_hook(), we have a deferred classmethod, via Cloud Tasks, which will do a few additional processing based on the updated or created entity. We pass self.key to the deferred method, and it will then start by retrieving the entity thru a simple entity = key.get().

It works most of the time, but it sometimes happens that key.get() will return None, although the entity definitely does exist. We tried adding use_cache=False, use_global_cache=False to the get(), but that did not help.
We could circumvent the issue by passing the entity itself (self) to the deferred method, but that is not good practice as it might have been modified before the deferred method runs. We could also add some code to sleep a while, then try again (and this actually does work). But I thought firebase in datastore mode now provided strong consistency, and I thought that meant I was certain to be able to read a freshly put entity.

The elapsed time between the put() and the deferred_post_put_hook() that will do the key.get() may vary, but in one example it was around 100ms, and yet the get() returned None.

Code example

class Thing(ndb.Model):
    name = ndb.StringProperty()

    def _post_put_hook(self, future):
        Thing.deferred_post_put_hook(self.key)

    @classmethod
    @defer()     # The defer() custom decorator handles calling the method via a Cloud Task.  It is quite reliable, and is likely not guilty here.
    def deferred_post_put_hook(cls, thing_key: ndb.Key):

        thing: Thing = thing_key.get()
        
        # Would process thing from here on, but on occasions thing is None
        
patb2 commented

Just to make it clearer, the effect of this "deferred" thing is that the thing_key.get() happens a few milliseconds after the put() and possibly on another GAE instance. This probably is a factor, because I have never witnessed a key.get() failing right after a put() on the same instance.

I have witnessed this problem as well in our software. We rely on this posthook method to keep an elasticsearch in sync.
the key just isn't set in the posthook. The key seems to be available to the rest of code following the put() statement.

We found a cause in the end. When the put would fail, for instance when the entity is larger than 1Mb, the post_hook is still triggered, with a empty key.
so for us it was just a side effect of another error we could prevent easily.

class Thing(ndb.Model):
    name = ndb.StringProperty()
    desc = ndb.TextProperty()

    def _post_put_hook(self, future):
        print(self.key)

thing = Thing(name= 'too big entity', desc='d' * 11000000)
# this put will throw an exception.  but the post_put_hook will run anyway with key=None
thing.put()

Hi @anoteboom, and thanks for helping keep this issue alive. The case you encounter, with a failure of the put() is different from ours however, in our case, the put() definitely did occur, and we can later retrieve the entity.

Here is an experiment we did when trying to find a workaround:

thing: Thing = thing_key.get()
if thing is None:
    thing: Thing = thing_key.get(use_cache=False, use_global_cache=False)
    if thing is not None:
        logging.debug("Thing was None with cache, and not using cache resolved it")

    if thing is None:
        time.sleep(8)
        thing: Thing = thing_key.get(use_cache=False, use_global_cache=False)
        if thing is not None:
            logging.debug("Thing was None with cache, and not using cache *and waiting* resolved it")

    if thing is None:
        logging.error(f"Error on get thing_key : {thing_key}")
        return

    # Things are ok here, continue ... 

What happens is we find all 3 traces in our logs, meaning that all 3 cases happen (together with a great majority of successes): (a) reading the entity with a simple get() fails but reading with no_cache succeeds, (b) reading with no_cache fails as well but trying again after 8 seconds succeeds, (c) reading after wait fails as well. We ended up leaving the wait in place since it recovers a number of errors. But it is not always sufficient.

patb2 commented

Hi, Trying to keep alive this unresolved issue (classified as bug so far).
python-ndb APIs allow a read_consistency argument, that may be ndb.EVENTUAL or ndb.STRONG. And it is quite explicit that ndb.EVENTUAL would lead to transient inconsistencies. However the default value is ndb.STRONG, so we would expect consistency to be the default.

Here are two possible ways to mitigate this (not yet tested in a production environment):

  1. Read with retry:
def get_with_retry(thing_key):
    for attempt in range(RETRY_COUNT):
        thing = thing_key.get(read_consistency=ndb.STRONG)
        if thing is not None:
            return thing
        time.sleep(RETRY_DELAY)
    raise Exception(f"Failed to get entity with key {thing_key} after {RETRY_COUNT} attempts")
  1. Read with transaction
def get_thing_in_transaction(thing_key):
    @ndb.transactional()
    def txn():
        return thing_key.get(read_consistency=ndb.STRONG)
    return txn()

thing = get_thing_in_transaction(thing_key)

None of these is quite satisfactory however.

Also, please refer to this "Google Database Products" article, that says this:

In Datastore, there are only two APIs that provide a strongly consistent view for reading entity values and indexes: (1) the lookup by key method and (2) the ancestor query. If application logic requires strong consistency, then the developer should use one of these methods to read entities from Datastore.

The article is undated unfortunately. It would seem that a "lookup by key" (which I understand is a key.get()), offers strong consistency.