mongodb-partners/mongo-rocks

Random failures of 'CheckReplOplog' in *_passthrough tests suites

igorsol opened this issue · 7 comments

Hello,

This issue affects all passthrough test suites. Here is the list (probably incomplete):

  • read_concern_majority_passthrough
  • aggregation_read_concern_majority_passthrough
  • replica_sets_jscore_passthrough
  • read_concern_linearizable_passthrough

This issue also affects both 3.4 and master branches. You can see example failures here (see the red lines in the left column on the pages):

Our investigation has shown that this issue is caused by this commit: 422336 by @AdallomRoy

We also have a possible fix here: #107
We test this fix now and so far our tests are good but it would be great to have more opinions and reviews.

Hi Igor,
Great catch!
Doesn't this patch basically remove the optimization for backward cursors? or am I missing something?

Roy.

@AdallomRoy Yes, I think this just removes the optimization for the backward cursors. However, master branch of mongo reimplements oplog visibility rules significantly, so this code-path will require big rewrite soon anyway. See #106

Got it. Probably the better fix would be that the backward cursor would check if the record is currently visible in terms of oplog visibility... but not sure it's critical.
Can you also merge it to v3.2? Thanks!

Done

Hi Roy and Igor,

You are right, my fix disabled optimisation for backward cursors.
But since we still want this optimisation for v3.4 and v3.2 I did additional debugging and found what is going wrong with backward cursors:

During the tests sometimes save()/restore() sequence is called on just created backward cursor BEFORE first call to next(). This causes lost of _skipNextAdvance state which should be true in case of oplog first record retrieval hack. Thus when next() is called redundant advance happens and cursor loses one document.

The fix for this is here: igorsol@031488e
It re-enables first record retrieval hack for backward cursors and fixes CheckReplOplog issues.
I will create pull request after additional tests.

Good find, thanks @igorsol !

Our testing has finally finished. Pull request is here: #111