googleapis/python-ndb

AttributeError: 'NoneType' object has no attribute 'replace'

remko opened this issue · 13 comments

remko commented

Hi,

We upgraded to google-cloud-ndb==1.10.0 (App Engine Standard app on Python 3.8), and had the error below occurring. It's not a consistent error. We didn't have problems with our previous version of NDB (1.5.2).

Can you tell what happened from the backtrace?

  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_options.py", line 89, in wrapper
    return wrapped(*pass_args, **kwargs)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/utils.py", line 114, in wrapper
    return wrapped(*args, **new_kwargs)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/utils.py", line 146, in positional_wrapper
    return wrapped(*args, **kwds)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/model.py", line 5381, in _put
    return self._put_async(_options=kwargs["_options"]).result()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 191, in result
    self.check_success()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 138, in check_success
    raise self._exception
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 311, in _advance_tasklet
    yielded = self.generator.throw(type(error), error, traceback)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/model.py", line 5442, in put
    ds_key = yield _datastore_api.put(ds_entity, kwargs["_options"])
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 311, in _advance_tasklet
    yielded = self.generator.throw(type(error), error, traceback)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_datastore_api.py", line 402, in put
    yield _cache.global_unlock_for_write(cache_key, lock)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 311, in _advance_tasklet
    yielded = self.generator.throw(type(error), error, traceback)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_cache.py", line 654, in global_unlock_for_write
    yield _update_key(key, new_value)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 315, in _advance_tasklet
    yielded = self.generator.send(send_value)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_cache.py", line 666, in _update_key
    value = new_value(old_value)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_cache.py", line 650, in new_value
    return old_value.replace(lock, b"")
AttributeError: 'NoneType' object has no attribute 'replace'```
remko commented

FYI, a few minutes later, it happened again, on a different model.

Thank you for reporting. Do you have any way of seeing whether the request this occurred on took an unusually long time to complete? Longer than, say, 32 seconds?

Also, are you using the Redis or Memcached implementation for the global cache?

remko commented

This is on Redis. Both requests were very fast (200ms-500ms)

Do you know if you're hitting your memory limit with Redis or if it's configured to evict keys? Any redis process restarts around that same time?

There's clearly a fault tolerance issue that wasn't taken into account, I'm just trying to get a handle on the nature of it to see what we're up against.

remko commented

I can't find anything interesting I'm afraid:

  • This is a HA Cloud Memorystore instance (fresh from today as it happens), with the default configuration, so no eviction AFAIK.
  • Memory usage is very low
  • At the moment, the instance is used exclusively by NDB
  • There have been no restarts, no uptime changes, no noticeable peaks in events/commands around the time
  • The change has been deployed for 5 hours now, and it only happened twice so far, with only 7 minutes in between. Different handlers, different models, different instances.
remko commented

One thing that may or may not be relevant: we switched from Basic Cloud Memorystore to an HA Cloud Memorystore around the same time as we did the NDB upgrade (which was hours before the error occurred), so I can't tell for sure that it wouldn't have happened with 1.5.2. However, I don't think it makes a difference to NDB whether this is HA or not.

Thanks for this info. I'll see if I can reproduce this.

Are you passing anything for strict_write when instantiating RedisCache?

remko commented

No, just RedisCache(client).

remko commented

@chrisrossi I may have some more info from another occurrence. Here is what I see:

  • There are 3 parallel requests that happen simultaneously that access the cached model:

      2021-07-25T02:57:10.706243Z GET 200 448 B 229 ms 
      2021-07-25T02:57:10.709093Z GET 200 448 B 296 ms
      2021-07-25T02:57:10.709312Z GET 500 402 B 384 ms 
    

    The last request fails with the above error.

  • All 3 requests immediately load the model, and keep the model in memory until the end of the request.

  • All 3 requests also update the model near the end of the request (pbb not relevant, but i think it's an irregularly updated property 'lastAccessedAround' that would only be updated once if they would happen sequentially, but since all 3 calls think they need to update it, they all do).

  • The failing update to the model (in the last request) happens after the first request finished updating the model.

  • I think you can assume that the model was not cached before these 3 calls. (from the logs, the previous access is from before the global cache timeout, and all accesses should be logged)

remko commented

I looked at the other 2 occurrences, and they all exhibit the same pattern above. One of them happens on only 2 requests that access the model:

    2021-07-23T14:43:40.863671Z GET 200 701 B 317 ms
    2021-07-23T14:43:40.864445Z GET 500 402 B 436 ms
  • The model was (extremely likely) not cached before the first call
  • Both requests happen simultaneously, read the model at start of request, keep it in memory, try to write at the end of the request, and the second fails.

So, it seems you can reproduce it with 2 well timed requests.

The other occurrence even had 4 simultaneous requests accessing the model in flight, of which (only) the last failed.

Thank you, this is helpful!

I think I've managed to find the root cause of this. It is a bug with concurrent writes and the global cache3:

Thread 1 and Thread 2 are both attempting to modify the same entity at the same time:

Thread 1: Calls global_lock_for_write, writes 00.lock1
Thread 2: Calls global_lock_for_write, watches 00.lock1
Thread 1: Calls global_unlock_for_write, reads 00.lock1 for old lock value, computes new lock value of 00
Thread 2: Completes CAS operation, writing 00.lock1.lock2
Thread 1: Deletes key, based on computed 00 value, which means there are no locks left.
Thread 2: Calls global_unlock_for_write, gets None for old value, above exception is raised.