scullxbones/akka-persistence-mongo

ScalaDriverPersistenceReadJournaller: Cursor expiration aborts streaming of current persistence IDs

Closed this issue · 4 comments

I had a job that processes current persistence IDs streamed from a read-journal. The job needs more than 10 minutes to process ~1 million persistence IDs, at which point the persistence ID stream fails because the MongoDB server deleted the cursor.

Remote stream (akka://<HOSTNAME>/StreamSupervisor-81/$$5-SinkRef-57) failed, reason:
Query failed with error code -5 and error message 'Cursor 26931902772071 not found on server <MONGODB>:27017' on server <MONGODB>:27017

I'm dodging the problem by splitting the aggregation into restartable chunks in a downstream project. It replaces the aggregation

journal.aggregate([{$project:{pid:1}},{$group:{_id:'$pid'}}])

by a series of aggregations (the $project stage does nothing btw; run it with {explain:true} to see)

journal.aggregate([
  {$sort:{pid:1}},
  {$match:{pid:{$gt:LAST_KNOWN_PID}}},
  {$limit:BATCH_SIZE},
  {$group:{_id:'$pid'}},
  {$sort:{_id:1}}
])

each of which is restartable according to a configured timeout. It requires 2 additional configuration options for the batch size and the timeout.

@scullxbones Would you replace the current implementation of ScalaDriverPersistenceReadJournaller.currentPersistenceIds by something like that? I could prepare a pull request.

Hi @yufei-cai -

Is this something that can be handled by using .noCursorTimeout()? I guess the question becomes, how to enable this flag since there isn't much of a surface to the ReadJournal calls.

Or maybe not ... SERVER-6036

It is not a bad idea to delete cursors after a while -- for all MongoDB knows, my cluster might be dead. Sad that the database and drivers ain't managing cursor lifecycles as well as they could have.

I'm getting sporadic cursor timeouts even with chunked aggregations. Gonna watch it for a few weeks.

After numerous mysterious "cursor not found" errors despite the lack of any long-living cursors on MongoDB I'm forced to conclude that a blanket "get-all-pids" query isn't worth making.