scullxbones/akka-persistence-mongo

Replace live query `EventBus` implementation with direct cursor

scullxbones opened this issue · 5 comments

Should:
+ simplify things significantly
+ better report errors with server
+ shut down better
- take up more resources on server (cursor-per-live-query)
...

Released with v2.1.0

Hey @scullxbones, we are on 2.1.0 and we saw a spike of errors on the read side. We are using akka.persistence.query.scaladsl.EventsByPersistenceIdQuery#eventsByPersistenceId to watch realtime updates. Because of the previous issues with the database connection we added a check where actor watching realtime events (listener) is also polling the persistence actor to check the current sequence number and when this listener sequence number is off it'll restart and will catch up on the missed events. But this restart cycle just continues for any new events. I guess the listener just doesn't see the realtime updates. This happens relatively rarely but adds some slowness to the system (for affected persistence entities), because read side only gets updated after restart which only happens after some number of seconds.
Let me know if I can help with debugging this. For now, I'll have to roll back the plugin version upgrade.

Hi @yahor-filipchyk - I think I get the gist, although the details are not clear. (Rarely) you don't receive realtime updates after the first failure to receive realtime updates? Is that it in a nutshell? I'm assuming you re-run the query with your supervising listener.

I can definitely use your help with a minimum duplicating test to research. Especially if it's a race condition.

Right, sometimes it fails ones for a persistence entity and stops receiving any new updates for the same persistence entity even after the actor restart. Restart helps to catch up on previous updates because it is probably reading events from the journal first when eventsByPersistenceId stream starts, right? I'm not sure if it breaks for all entities for the same plugin id or only for a single entity.

We only saw this in prod so far, where we have some amount of concurrent updates. Not sure how to reproduce this consistently. Perhaps writing some automated test can help, I can try writing one when I have time.