AttributeError: 'NoneType' object has no attribute 'replace'

Question

AttributeError: 'NoneType' object has no attribute 'replace'

remko opened this issue 3 years ago · 13 comments

Hi,

We upgraded to google-cloud-ndb==1.10.0 (App Engine Standard app on Python 3.8), and had the error below occurring. It's not a consistent error. We didn't have problems with our previous version of NDB (1.5.2).

Can you tell what happened from the backtrace?

  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_options.py", line 89, in wrapper
    return wrapped(*pass_args, **kwargs)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/utils.py", line 114, in wrapper
    return wrapped(*args, **new_kwargs)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/utils.py", line 146, in positional_wrapper
    return wrapped(*args, **kwds)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/model.py", line 5381, in _put
    return self._put_async(_options=kwargs["_options"]).result()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 191, in result
    self.check_success()
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 138, in check_success
    raise self._exception
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 311, in _advance_tasklet
    yielded = self.generator.throw(type(error), error, traceback)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/model.py", line 5442, in put
    ds_key = yield _datastore_api.put(ds_entity, kwargs["_options"])
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 311, in _advance_tasklet
    yielded = self.generator.throw(type(error), error, traceback)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_datastore_api.py", line 402, in put
    yield _cache.global_unlock_for_write(cache_key, lock)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 311, in _advance_tasklet
    yielded = self.generator.throw(type(error), error, traceback)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_cache.py", line 654, in global_unlock_for_write
    yield _update_key(key, new_value)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/tasklets.py", line 315, in _advance_tasklet
    yielded = self.generator.send(send_value)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_cache.py", line 666, in _update_key
    value = new_value(old_value)
  File "/layers/google.python.pip/pip/lib/python3.8/site-packages/google/cloud/ndb/_cache.py", line 650, in new_value
    return old_value.replace(lock, b"")
AttributeError: 'NoneType' object has no attribute 'replace'```

Answer 1 · 2021-07-23T14:51:26.000Z

FYI, a few minutes later, it happened again, on a different model.

Answer 2 · 2021-07-23T15:27:41.000Z

Thank you for reporting. Do you have any way of seeing whether the request this occurred on took an unusually long time to complete? Longer than, say, 32 seconds?

Answer 3 · 2021-07-23T15:28:14.000Z

Also, are you using the Redis or Memcached implementation for the global cache?

Answer 4 · 2021-07-23T15:39:09.000Z

This is on Redis. Both requests were very fast (200ms-500ms)

Answer 5 · 2021-07-23T16:37:37.000Z

Do you know if you're hitting your memory limit with Redis or if it's configured to evict keys? Any redis process restarts around that same time?

There's clearly a fault tolerance issue that wasn't taken into account, I'm just trying to get a handle on the nature of it to see what we're up against.

Answer 6 · 2021-07-23T16:53:24.000Z

I can't find anything interesting I'm afraid:

This is a HA Cloud Memorystore instance (fresh from today as it happens), with the default configuration, so no eviction AFAIK.
Memory usage is very low
At the moment, the instance is used exclusively by NDB
There have been no restarts, no uptime changes, no noticeable peaks in events/commands around the time
The change has been deployed for 5 hours now, and it only happened twice so far, with only 7 minutes in between. Different handlers, different models, different instances.

Answer 7 · 2021-07-23T16:55:53.000Z

One thing that may or may not be relevant: we switched from Basic Cloud Memorystore to an HA Cloud Memorystore around the same time as we did the NDB upgrade (which was hours before the error occurred), so I can't tell for sure that it wouldn't have happened with 1.5.2. However, I don't think it makes a difference to NDB whether this is HA or not.

Answer 8 · 2021-07-23T17:20:42.000Z

Thanks for this info. I'll see if I can reproduce this.

Are you passing anything for strict_write when instantiating RedisCache?

Answer 9 · 2021-07-23T17:26:53.000Z

No, just RedisCache(client).

Answer 10 · 2021-07-25T06:50:55.000Z

@chrisrossi I may have some more info from another occurrence. Here is what I see:

There are 3 parallel requests that happen simultaneously that access the cached model:

  2021-07-25T02:57:10.706243Z GET 200 448 B 229 ms 
  2021-07-25T02:57:10.709093Z GET 200 448 B 296 ms
  2021-07-25T02:57:10.709312Z GET 500 402 B 384 ms

The last request fails with the above error.

All 3 requests immediately load the model, and keep the model in memory until the end of the request.
All 3 requests also update the model near the end of the request (pbb not relevant, but i think it's an irregularly updated property 'lastAccessedAround' that would only be updated once if they would happen sequentially, but since all 3 calls think they need to update it, they all do).
The failing update to the model (in the last request) happens after the first request finished updating the model.
I think you can assume that the model was not cached before these 3 calls. (from the logs, the previous access is from before the global cache timeout, and all accesses should be logged)

Answer 11 · 2021-07-25T07:11:26.000Z

I looked at the other 2 occurrences, and they all exhibit the same pattern above. One of them happens on only 2 requests that access the model:

    2021-07-23T14:43:40.863671Z GET 200 701 B 317 ms
    2021-07-23T14:43:40.864445Z GET 500 402 B 436 ms

The model was (extremely likely) not cached before the first call
Both requests happen simultaneously, read the model at start of request, keep it in memory, try to write at the end of the request, and the second fails.

So, it seems you can reproduce it with 2 well timed requests.

The other occurrence even had 4 simultaneous requests accessing the model in flight, of which (only) the last failed.

Answer 12 · 2021-07-25T15:27:41.000Z

Thank you, this is helpful!

Answer 13 · 2021-08-02T20:59:55.000Z

I think I've managed to find the root cause of this. It is a bug with concurrent writes and the global cache3:

Thread 1 and Thread 2 are both attempting to modify the same entity at the same time:

Thread 1: Calls global_lock_for_write, writes 00.lock1
Thread 2: Calls global_lock_for_write, watches 00.lock1
Thread 1: Calls global_unlock_for_write, reads 00.lock1 for old lock value, computes new lock value of 00
Thread 2: Completes CAS operation, writing 00.lock1.lock2
Thread 1: Deletes key, based on computed 00 value, which means there are no locks left.
Thread 2: Calls global_unlock_for_write, gets None for old value, above exception is raised.