paritytech/parity-db

Optimize memory usage

arkpar opened this issue · 5 comments

Under certain write workloads, when there's a lot of commits with a lot of small value insertions, the memory used by the index overlay can be blown by a factor of 10 or more when compared to the actual data inserted. There's a good example in #93. This is because the overlay keeps the whole index page in memory, even though only a single entry in the page may be modified.
I can think of two ways to fix that:

  1. Don't keep the whole page, but rather only modified entries.

  2. Investigate existing mechanisms in linux to control page flushing for memory mapped files. If it is possible to control when exactly individual pages are flushed to disk, we don't need the overlay in the first place.

In any case, this is only worth fixing if the solution does not introduce any performance hits.

i1i1 commented

Hey! memmap2 crate which is used supports flushing memory regions - memmap2::MmapMut::flush_range.

Could you please point me to overlays which won't be needed after making use of that?

flush_range won't be much of use here unfortunately. What we actually need is something that prevents flushing until explicitly
requested.

This is how the database guarantees consistency at the moment, simplified.

  1. Each commit results in a set of pages that need to be updated in memory-mapped files.
  2. These pages are copied to heap-allocted memory overlay and modifications are performed there.
  3. The database writes the pages to a write-ahead log and flushes it to disk.
  4. The database then reads the log and applies changes to the memory map.
  5. The heap-allocated memory overlay is released.

We can't modify the pages in the memory-mapped regions initially because we don't know when the kernel decides to flush them to the disk. We need to make sure that the modified pages can't be partially persisted before the WAL is flushed. This guarantees that upon recovery any partial-written page sets will be fixed by replaying the WAL.
So here's how it could work ideally:

  1. Each commit results in a set of pages that need to be updated in memory-mapped files.
  2. Pages that are about to be modified are locked in memory, so that they are not flushed to disk by the kernel.
  3. These pages are modified in the memory-mapped address space directly.
  4. The database writes the pages to a write-ahead log and flushes it to disk.
  5. The database then unlocks the pages and allows the kernel to write them to the final locatioin.

Page locking may be achievable with mlock, although the documentation is not 100% clear if locking will also prevent memory-mapped IO. So this requires some research and experimentation.

i1i1 commented

Thanks for the digest. Makes sense.

mlock(2) states that:

mlock() locks pages in the address range starting at addr and
continuing for len bytes. All pages that contain a part of the
specified address range are guaranteed to be resident in RAM when
the call returns successfully; the pages are guaranteed to stay
in RAM until later unlocked.

munlock() unlocks pages in the address range starting at addr and
continuing for len bytes. After this call, all pages that
contain a part of the specified memory range can be moved to
external swap space again by the kernel.

I think that means that there is a guarantee for that pages won't be flushed to disk untill munlock is called.

But I don't know if VirtualLock is an equivalent to mlock.

i1i1 commented

@arkpar What do you think of that?

Indeed, this should be the case. man page also has this section:

Memory locking has two main applications: real-time algorithms
and high-security data processing. ... Cryptographic security software
often handles critical bytes like passwords or secret keys as
data structures. As a result of paging, these secrets could be
transferred onto a persistent swap store medium, where they might
be accessible to the enemy long after the security software has
erased the secrets in RAM and terminated.

So locked pages should not be copied to disk.

VirtualLock has the same guarantee:

Pages that a process has locked remain in physical memory until the process unlocks them or terminates. These pages are guaranteed not to be written to the pagefile while they are locked.

There are a few issues that still need clarification

  • Locked page limit. Default limits on the number of locked pages seems to be rather small. It should be possible to raise the limit on startup though with setrlimit or SetProcessWorkingSetSize but it is not clear what would be the maximum for a typical Linux/Mac/Windows system.
  • Page size alignment. Various systems may be configured for different page sizes. Database index currently operates on 512-byte pages. If the kernel is configured to use huge page sizes, this may become an issue. Although I imagine systems that use that option usually have lots of RAM and are fine with using more of it.