mongodb-partners/mongo-rocks

Compaction of dropped prefixes on startup takes forever

Closed this issue · 20 comments

We have a huge MongoRocks single instance (13TB compressed, spinning disk). We took it down for a short maintenance, and encountered some compaction problems on startup, which seems to be related to the code introduced a few months ago.

After restart, mongo.log will say:

2018-02-05T15:31:31.928+0000 I CONTROL  [initandlisten] options: { config: “/etc/mongod.conf”, net: { bindIp: “0.0.0.0", port: 27017 }, processManagement: { fork: true, pidFilePath: “/var/run/mongod.pid” }, security: { authorization: “enabled”, keyFile: “/etc/mongod.keyfile” }, setParameter: { rocksdbConcurrentReadTransactions: “256", rocksdbConcurrentWriteTransactions: “256” }, storage: { dbPath: “/data/mongodb”, engine: “rocksdb”, journal: { enabled: true }, rocksdb: { configString: “target_file_size_multiplier=2;max_background_flushes=4;max_background_compactions=60;compression_per_level=kNoCompression:kSnappyCompression:kZlibComp...” } }, systemLog: { destination: “file”, logAppend: true, logRotate: “reopen”, path: “/var/log/mongodb/mongod.log” } }
2018-02-05T15:31:32.365+0000 I STORAGE  [initandlisten] 1821 dropped prefixes need compaction

and it’s locked for writing - only reads.

Looking at the db/LOG file we see:

2018/02/05-15:54:25.106365 7fc0054b2700 (Original Log Time 2018/02/05-15:54:25.105318) Calling FlushMemTableToOutputFile with column family [default], flush slots available 4, compaction slots allowed 1, compaction slots scheduled 1

It seems to ignore the setting max_background_compactions=60 ans therefore only one CPU core is used, and it’s progressing very slowly. In the meantime, the DB is locked for writing.

The questions:

  1. Can this be somehow tuned to use more than 1 compaction thread?
  2. Why does it wait for a restart to do all those compactions? We dropped some heavy collections a few weeks ago, but this was done after we upgraded to the current version (PSMDB v3.4.7-1.8). We would expect it to schedule those compactions immediately after dropping, and not wait for a restart
  3. Why is the DB locked for writing during this process?

Update:
After around 24 hours, it still says (when restarted): 1528 dropped prefixes need compaction. In other words, it will take about 5 more days to complete at this rate...
In the meantime, DB is not really locked for writes, but they are extremely slow - making the DB practically unusable for production.

Perhaps the extreme slowdown we're experiencing while this is running is related to this: facebook/rocksdb#3198?

Hm, this is unexpected. What's the CPU utilization on that one core? Can you share more of the db/LOG file?

We have 1 thread constantly at 100% CPU, all the other threads are pretty much idle.

Here are all the rocksdb LOG files since the restart that triggered this:
rocksdb_log.tar.gz
For reference, mongodb restart times were made around:
2018-02-05T10:28:01.923+0000 I STORAGE [initandlisten] 1869 dropped prefixes need compaction
2018-02-05T11:43:38.168+0000 I STORAGE [initandlisten] 1867 dropped prefixes need compaction
2018-02-05T12:19:04.052+0000 I STORAGE [initandlisten] 1866 dropped prefixes need compaction
2018-02-05T14:54:28.188+0000 I STORAGE [initandlisten] 1840 dropped prefixes need compaction
2018-02-05T15:18:21.474+0000 I STORAGE [initandlisten] 1833 dropped prefixes need compaction
2018-02-05T15:22:08.518+0000 I STORAGE [initandlisten] 1827 dropped prefixes need compaction
2018-02-05T15:31:32.365+0000 I STORAGE [initandlisten] 1821 dropped prefixes need compaction
2018-02-05T16:27:17.586+0000 I STORAGE [initandlisten] 1805 dropped prefixes need compaction
2018-02-06T12:15:23.719+0000 I STORAGE [initandlisten] 1528 dropped prefixes need compaction

If needed, I can extract more relevant lines from the mongodb log (can't attach it as it contains too much internal data)

Hello @zoharby
I will try to explain what is going on.

  1. There was a bug in mongo-rocks: when you deleted an index or collection the entity was marked as deleted but actual data records was never freed from rocksdb database. This is why you see those messages like "1869 dropped prefixes need compaction". That means at least 1869 indexes or collections was deleted in your database before you upgraded to PSMDB v3.4.7-1.8 but all that data is still in your database.
  2. The bug was fixed in PSMDB v3.4.4-1.4. Here is quote from release notes:

Fixed cleanup of deleted documents and indexes for MongoRocks. When you upgrade to this release, deferred compaction may occur and cause database size to decrease significantly.

Now you see this in action. The good thing is that this is one-time event - when all those dropped prefixes will be compacted your DB will return to normal mode. The bad thing is that probably thousands of collections/indexes means terabytes of data in your case and speed of compaction is limited by your storage devices.

Now I will try to answer your questions:

  1. Current implementation runs compactions of dropped prefixes sequentially in dedicated thread. This is not configurable. If compaction speed is limited by storage subsystem then adding more threads may even slowdown overall progress.
  2. Current implementation does not wait for restart. If some collection is dropped it is immediately scheduled for compaction. In your case there are just many deferred compactions stacked before you upgraded to fixed version.
  3. As you already mentioned DB is not really locked for writes. It looks like compaction running on huge amount of data slows down writes going in parallel.

@zoharby Thanks for the logs. Looking at them, there seems to be no back-off pressure started by RocksDB to slow down writes. Slowing down writes is most likely related to facebook/rocksdb#3198 in that case. Would you mind getting a stack trace capture using gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p <mongo-pid>?

Thanks @igorsol,

We're well aware of the fix you mentioned, but the thing is this instance is quite new (installed on Nov-2017). I'm not sure if it ran 3.4.7 since its start, but it definitely never ran a version older than 3.4.4. Also, as can be seen in some periodic prints in the log, its version was not upgraded prior to the restart that triggered the event:

2018-01-25T06:28:48.712+0000 I CONTROL  [signalProcessingThread] db version v3.4.7-1.8
2018-01-26T06:31:08.530+0000 I CONTROL  [signalProcessingThread] db version v3.4.7-1.8
2018-01-27T06:38:19.494+0000 I CONTROL  [signalProcessingThread] db version v3.4.7-1.8
2018-01-28T06:34:48.853+0000 I CONTROL  [signalProcessingThread] db version v3.4.7-1.8
2018-01-29T06:30:33.439+0000 I CONTROL  [signalProcessingThread] db version v3.4.7-1.8
2018-01-30T06:29:01.155+0000 I CONTROL  [signalProcessingThread] db version v3.4.7-1.8
--> Service + server restart here
2018-02-05T10:27:02.553+0000 I CONTROL  [initandlisten] db version v3.4.7-1.8
2018-02-05T10:28:01.923+0000 I STORAGE  [initandlisten] 1869 dropped prefixes need compaction

We did drop a few dozen (but surely not 1869) very heavy collections between 18/1-24/1, and we suspect that those drops are what triggered this compaction on the next restart. Is it possible that there is a bug in the implementation of the mentioned fix, that effectively defers those compactions till the next server restart? Otherwise, there is no other explanations as to why this was triggered just now, and not straight away after the drop operations. It's worth mentioning that it is possible that these were the first-ever drop collection operations on this instance.

As to your answers for 3: a background compaction running in a single thread should have such a dramatic effect on the DB performance. Specifically not when its bottleneck is the CPU and not the filesystem (as is evident by the single thread constantly at 100% CPU). Here's the I/O load of the disk for the last month:
image
where the maintenance event after which the issue started is marked by the vertical red bar on Feb 5th.
You may notice this disk is able to handle loads in order of magnitude larger than what it's currently doing while compacting. While its currently doing the compaction - we tried resuming the same insert load it had between Jan 30th-Feb 5th (which is much lower than it had during the weeks before) but it just wouldn't hold anything, and writes would get stuck for anywhere between seconds to minutes.

We're now directing the inserts to a temporary new instance we installed as an emergency measure, and are hoping the DB would return to normal performance once compaction is over, so we'll be able to direct inserts back to it and move all the data from the temporary instance to this instance.

Finally, if there's anything else we can do to help figuring this out, just let us know. We can hop on call, or maybe even give you temporary access to the server.

@igorcanadi - here is the current output:
trace.log

Note that no writes at all are currently made to the instance. We can trigger some load if needed.

@zoharby Can you please trigger some load? From the stack trace you posted, it doesn't seem like the mutex is hot. I would be curious to see where the writes are blocked.

@zoharby I understand that this instance has never run version older than 3.4.4. But which version was used to create and populate this DB? Is it possible that DB was created by older version?

We did drop a few dozen (but surely not 1869) very heavy collections between 18/1-24/1, and we suspect that those drops are what triggered this compaction on the next restart.

That number (1869) also includes all dropped indexes (for example reIndex command will create N dropped prefixes where N is the number of indexes on collection).
Do you have logs from the previous start of the instance? I mean when the instance was started before you dropped those few dozen collection. If you have those logs you can check the number of dropped prefixes there. Was it zero?

I don't think compactions could be delayed until restart but if for example there were 2500 dropped prefixes in queue and server was restarted when this number dropped to 1869 this can explain your observations.

Here are a few stack traces and a fresh rocksDB log after triggering back some load. It seems the connection threads (connXX) were also constant at 100% CPU when I triggered back writes, which is quite strange...
trace_with_writes.log
trace_with_writes_2.log
trace_with_writes_3.log
rocksdb_log_with_load.tar.gz

@igorsol - The instance was built on November with a clean DB, and we gradually migrated all the data into it from a WiredTiger instance using application based inserts. So I don't think there's any scenario in which it contains pre 3.4.4 data.

Unfortunately, our log only goes back to Jan 25th, so I don't have the previous service start logged. As I noted, I don't think any collection drops were made on this instance before the previous restart (not anything massive anyway).

Here are some logs from another replica-set instance, showing the dropped prefixes count fluctuating between different non-zero values after successive restarts which do not involve a version update:

2018-02-07T06:30:21.780+0000 I CONTROL  [signalProcessingThread] db version v3.4.7-1.8
2018-02-07T13:17:07.932+0000 I CONTROL  [initandlisten] MongoDB starting : pid=19974 port=27017 dbpath=/data/mongodb 64-bit host=g***2
2018-02-07T13:17:07.932+0000 I CONTROL  [initandlisten] db version v3.4.7-1.8
2018-02-07T13:17:08.030+0000 I STORAGE  [initandlisten] 0 dropped prefixes need compaction
2018-02-07T13:19:05.047+0000 I CONTROL  [initandlisten] MongoDB starting : pid=763 port=27017 dbpath=/data/mongodb 64-bit host=g***2
2018-02-07T13:19:05.047+0000 I CONTROL  [initandlisten] db version v3.4.7-1.8
2018-02-07T13:21:11.017+0000 I CONTROL  [initandlisten] MongoDB starting : pid=1005 port=27017 dbpath=/data/mongodb 64-bit host=g***2
2018-02-07T13:21:11.017+0000 I CONTROL  [initandlisten] db version v3.4.7-1.8
2018-02-07T13:21:56.083+0000 I STORAGE  [initandlisten] 16 dropped prefixes need compaction
2018-02-07T13:25:33.840+0000 I CONTROL  [initandlisten] MongoDB starting : pid=1218 port=27017 dbpath=/data/mongodb 64-bit host=g***2
2018-02-07T13:25:33.840+0000 I CONTROL  [initandlisten] db version v3.4.7-1.8
2018-02-07T13:25:44.229+0000 I STORAGE  [initandlisten] 2 dropped prefixes need compaction
2018-02-07T13:36:37.351+0000 I CONTROL  [initandlisten] MongoDB starting : pid=1505 port=27017 dbpath=/data/mongodb 64-bit host=g***2
2018-02-07T13:36:37.351+0000 I CONTROL  [initandlisten] db version v3.4.7-1.8
2018-02-07T14:08:22.050+0000 I CONTROL  [initandlisten] MongoDB starting : pid=998 port=27017 dbpath=/data/mongodb 64-bit host=g***2
2018-02-07T14:08:22.050+0000 I CONTROL  [initandlisten] db version v3.4.7-1.8
2018-02-07T14:08:48.290+0000 I STORAGE  [initandlisten] 16 dropped prefixes need compaction

@zoharby In the trace you sent me there is no mutex contention. Instead, all the writes are actually blocked on reads, and all the reads are busy reading bloom filters from disk. From the LOG, it appears that you have set cache_index_and_filter_blocks to true. Did you do it because you didn't have enough memory to keep all index and filter blocks always loaded?

It is known that evicting index and filter blocks causes degradation in performance. It seems like RocksDB team added a new option cache_index_and_filter_blocks_with_high_priority, most likely to address this issue (https://github.com/facebook/rocksdb/blob/master/include/rocksdb/table.h#L72). Can you try setting that to true?

As to why compaction is causing performance degradation, it is most likely due to dirtying of the page cache. Can you try setting use_direct_io_for_flush_and_compaction to true (https://github.com/facebook/rocksdb/blob/master/include/rocksdb/options.h#L601)? That should prevent compaction from affecting the page cache.

Sorry about RocksDB's bad behavior in this case. It is not yet very well tuned (by default) for running on machines with huge amount of disk and low amount of memory. There are options that could make it better, though :)

Wow, that's a great lead, didn't even think in that direction! Indeed we decreased the RAM size of the machine during this maintenance, thinking that because it's an "archiving" instance with basically write-only load, we don't really need much RAM. But I guess we crossed some point where it's halting everything due to cache thrashing...
Will try to increase it back and see if it solves the problem.

And yeah, I think we turned on cache_index_and_filter_blocks because RAM usage was growing too high as the DB size grew. Again, thinking this is anyway an insert-only instance 99.9% of the time, we didn't think it would affect performance. I guess the combination of cache_index_and_filter_blocks=true + small RAM + background compaction of "cold" data was enough to halt everything.
We will play around with enabling pin_l0_filter_and_index_blocks_in_cache and use_direct_io_for_flush_and_compaction, that should probably improve the situation.

We increased the instance RAM back, and indeed it solved the slowness.
Before increasing the RAM, we indeed saw the db.serverStatus().rocksdb.counters["block-cache-filter-bytes-insert"] metric growing constantly at alarming rates.
It seems that on contrary to what's written here, the default cache size for MongoRocks is not 50% but actually slightly less than 30% of the machine RAM.
When we decreased the instance RAM to 15GB, this resulted in a cache size of only about 4GB, which was simply not enough to hold the required bloom filters + written data. This in turn caused severe cache thrashing which crippled the DB completely.

So, our takes from this are:

  • DBs need RAM even when they do only writes :-)
  • We now know that if we turn on cache_index_and_filter_blocks, we must keep an eye on block-cache-filter-bytes-insert to ensure our performance is not degraded by this bottleneck
  • For an archiving oriented configuration (very large DB, constant insert-only writes, very rare reads), the default cache size of 30% RAM is far from optimal, and should probably be increased to 50-60%.

Regarding the original issue of dropped prefixes compaction possibly being deferred until server restart - I now acknowledge this had nothing to do with our performance issues. When we finish migrating the data from the temp instance back to the main one, I will try to play around with our periodic DB snapshots to try to understand this fully. It is possible that the compactions indeed started right after dropping, but just took a VERY long time (~3 weeks) to finish, and hence when we restarted the server after 2 weeks it still had quite some work to do. I will post an update when I finish investigating this.

It seems that on contrary to what's written here, the default cache size for MongoRocks is not 50% but actually slightly less than 30% of the machine RAM.

Thanks for noticing this, I made the change.

For an archiving oriented configuration (very large DB, constant insert-only writes, very rare reads), the default cache size of 30% RAM is far from optimal, and should probably be increased to 50-60%.

That is true, but only if you set cache_index_and_filter_blocks to true.

That is true, but only if you set 'cache_index_and_filter_blocks' to 'true'.

That's right :)

I also see you have optimize_filters_for_hits=true by default for MongoRocks, which makes perfect sense for the MongoDB use case

Unfortunately, I don't have a backup allowing me to investigate the issue further.
I also realized we did have a lot of junk collections which we dropped from the instance (we use application based partitions and some were created on junk timestamps), so it makes sense we erased 1821 collections/indexes.
I can't really test whether compacting them started straight away and simply didn't finish until the server restart, or whether those might have been deferred till the restart. But anyway, we now fully understand that all performance issues were completely unrelated to the prefix compactions, so I guess it doesn't really matter how long they take to finish.
Thanks very much @igorcanadi and @igorsol for you great help and responsiveness! We know we're sometimes testing MongoRocks to its extremes, and it's great to know you guys are there when we hit some unexpected behavior.