mongodb-partners/mongo-rocks

secondary lost data

baiwfg2 opened this issue · 12 comments

Recently I've found a problem about secondaries losing records on benchmarking MongoRocks 4.0.

Loading a small amount of data is ok, and secondaries have got all data. But when loading a relatively large amount of data, it's very easily reproducible for secondary to lost several records. Here's my settings.

YCSB workload

recordcount=500000
operationcount=1000
workload=site.ycsb.workloads.CoreWorkload

readallfields=true

readproportion=0
updateproportion=0
scanproportion=0
insertproportion=1

requestdistribution=zipfian

fieldlength=1024
fieldcount=5
mongodb.url=mongodb://10.108.103.160:50001/ycsb?w=1

threadcount=16
table=t

Just execute the YCSB loading, and in the end I always find several records have been lost, either by my own python script(iterating primary and checking whether records exist on secondaries) or by mongo shell's count() or itcount()

image
The screenshot above shows host 50003 has lost one record.

In the mongod(secondaries) log file, there're following error that I'm not sure whether it's relevant:
image
I guess it's not very relevant, because host 50002 also has that error, but got all data instead.

@baiwfg2 I tried several times with your workload only to find latency but not find data lost on secondaries.

  1. Can you post your mongod's configuration file please?
  2. Can you turn fieldlength from 1024 to 100 to check if this can be reproduced? Since small values load faster and can eliminate IO's effect.
  3. Are you sure you did not change something about mongodb or rocksdb source code?

My running command is :

taskset -c 1-12 ./mongod-rocks --replSet rs1 --storageEngine rocksdb --dbpath $dbpath --bind_ip 0.0.0.0 --port 50001 --logpath 5k1.log

The other two mongod are similar.

The RS config is:

rs.initiate({ _id: "rs1", version: 1, members: [ { _id:0,host:"10.108.103.160:50001"},{_id:1,host:"9.34.46.252:50002"},{_id:2,host:"10.108.117.196:50003"} ], settings: { chainingAllowed: false} })

Pretty sure that I haven't changed original code except my own comments.

Today I intent to enable two secondaries's verbose logging and see if anything is missed.

@baiwfg2 , as I've not been able to reproduce(probably because my machine has not many cores). I'd suggest you to try

  1. to check if the oplogs on secondary matches primary, if not, which one is lost
  2. add secondary to catch up after loading data to primary is finished to see if the newly added node can reproduce?
    If 2) never reproduces, this probably is because the oplog_read_source may problemicly missed
    some entries, which may intend that, "all_committed" calculation is bugful
  3. patch lines below to totdb to reproduce? I do believe this has no relation with lines below but we can have a try.
diff --git a/utilities/transactions/totransaction_db_impl.cc b/utilities/transactions/totransaction_db_impl.cc
index 971768a..81d0ffd 100644
--- a/utilities/transactions/totransaction_db_impl.cc
+++ b/utilities/transactions/totransaction_db_impl.cc
@@ -368,10 +368,6 @@ Status TOTransactionDBImpl::AddCommitQueue(const std::shared_ptr<ATN>& core,
     if (ts < core->first_commit_ts_) {
       return Status::InvalidArgument("commit-ts smaller than first-commit-ts");
     }
-    if (ts < core->commit_ts_) {
-      // in wt3.0, there is no restriction like this one.
-      return Status::InvalidArgument("commit-ts must be monotonic in a txn");
-    }
     core->commit_ts_ = ts;
     return Status::OK();
   }
diff --git a/utilities/transactions/totransaction_impl.cc b/utilities/transactions/totransaction_impl.cc
index 1cd758d..11af88c 100644
--- a/utilities/transactions/totransaction_impl.cc
+++ b/utilities/transactions/totransaction_impl.cc
@@ -88,17 +88,6 @@ Status TOTransactionImpl::SetCommitTimeStamp(const RocksTimeStamp& timestamp) {
     return Status::NotSupported("not allowed to set committs to 0");
   }
 
-  if (core_->commit_ts_set_) {
-    if (core_->commit_ts_ > timestamp) {
-      return Status::NotSupported("commit ts need equal with the pre set");
-    }
-    if (core_->commit_ts_ == timestamp) {
-      return Status::OK();
-    }
-  }
-
-  assert((!core_->commit_ts_set_) || core_->commit_ts_ < timestamp);
-
   // publish commit_ts to global view
   auto s = txn_db_impl_->AddCommitQueue(core_, timestamp);
  1. keep recordcount not change but change filelength to 100 to reproduce. this will reduce much time for one run.

I'm trying to find a machine with more cpus to reproduce.

@baiwfg2 I've run for 10 passes of this ycsb parameters on a 32core machine. each node can burst into 800% or more cpu. I didnt find any unmatch between primary and two secondaries.
I'll try to run for more turns.
If you have tried these suggestions I listed above, you can post your results here.

I've performed 4 times table inserts, and secondaries in the first 3 cases have succeed in getting all data. In the 4th cases, I increased the recordcount to 2 000 000, and finally one of secondary (9.34.46.252:50002 in my case)missed two records, although I waited enough time for it to catch up.
image

@baiwfg2 thanks for your insist, I wrote a auto run script to reproduce.
After running for hundreds of turns, it happens.
I will dig into that.

I've found how it happened.

        switch (_timestampReadSource) {
            case ReadSource::kUnset:
            case ReadSource::kNoTimestamp: {
                RocksBeginTxnBlock txnOpen(_db, &_transaction);

                if (_isOplogReader) {
                    auto status =
                        txnOpen.setTimestamp(Timestamp(_oplogManager->getOplogReadTimestamp()),
                                             RocksBeginTxnBlock::RoundToOldest::kRound);
                    invariant(status.isOK(), status.reason());
                }
                txnOpen.done();
                break;
            }

here you can see, txn begins, and then query oplogReadTimestamp (allCommitTs) as readTs.
The problem here can be describe below
[A begins] [B commit(ts = 100] [advance allCommitTs to 100] [A set readTs = 100] [A read]
from timestamp aspect, A can see B, but as B commits after A begins, from SI aspect, A can not see B. In this way, a hole occurs.

how to fix

easy to fix, just query oplogReadTs before txn open(when snapshot is taken). like below

                Timestamp readTs;
                if (_isOplogReader) {
                     readTs = _oplogManager->getOplogReadTimestamp();
                }
                RocksBeginTxnBlock txnOpen(_db, &_transaction);

                if (_isOplogReader) {
                    auto status =
                        txnOpen.setTimestamp(readTs,
                                             RocksBeginTxnBlock::RoundToOldest::kRound);
                    invariant(status.isOK(), status.reason());
                }

confusions

@baiwfg2 if you dig into mongodb 4.2 series, you can see the official mongodb has changed the logic here as the latter type. however, in mongodb 4.0.x series, the logic acts as the above former type. mongoRocks follows mongo-wt here. So does this figure out that mongo-wt4.0 series also has bug here? I'm trying to construct some race conditions on mongo-wt4.0 to reproduce.
If I can not reproduce it on mongo-wt4.0, I should have misunderstood something about wiredtiger.
but if I can reproduce, it will be true that mongo4.0 series all over the world has the possibility of losing data.

how to let this race happen more frequent ?

if you add a random sleep between txn-begin and query oplogReadTs, this problem can be reproduced much more frequent, the demo code is like below

        switch (_timestampReadSource) {
            case ReadSource::kUnset:
            case ReadSource::kNoTimestamp: {
                RocksBeginTxnBlock txnOpen(_db, &_transaction);

                if (_isOplogReader) {
                    if (rand() % 100 == 1) {
                        usleep(10000);
                    }
                    auto status =
                        txnOpen.setTimestamp(Timestamp(_oplogManager->getOplogReadTimestamp()),
                                             RocksBeginTxnBlock::RoundToOldest::kRound);
                    invariant(status.isOK(), status.reason());
                }
                txnOpen.done();
                break;
            }

I found wiredTiger reset a new snapshot when settting readTimestamp, it guarantees the consistent view of SI and timestamp. I missed that and got this bug.
I'll make totdb the same as wt at this.
It is the root cause.

fixed by this
commit wolfkdy/rocksdb@5eb8a67

Nice diagnostic work, and thanks for the clear and detailed description.

@agorrod thanks for your attention, Alex :)