RazrFalcon/memmap2-rs

Merging changes with memmapix

nyurik opened this issue · 24 comments

nyurik commented

Hi, I noticed that @al8n has made some changes in their fork https://github.com/al8n/memmapix, while this repo has also done some diverging work. This repo clearly has far more downloads/community following, so I wonder if it would be possible to merge some/all of the memmapix changes into this repo, or what's your future plans for this repo?

Both efforts would clearly benefit the community, and highly appreciated, thank you all!

P.S. The memmapix repo doesn't have issues tab for some reason, so I couldn't ask for it there.

I think the main change there is unconditionally swapping out libc for rustix to support Unix-like platforms without going through the system C library which we discussed here and decided against for now, c.f. #89.

I guess we could add a feature for swapping out those two system call layers, e.g.

[dependencies]
libc = { version = "0.2", optional = true }
rustix = { version = "0.35", features = ["fs", "mm", "param"], optional = true }

[features]
default = ["libc"]

but I still have no heard a good motivation as for why the additional complexity for maintaining both implementations is required.

nyurik commented

My primary motivation is to have async memmap support for crates like pmtiles, which uses @al8n's fmmap -- a higher-level lib based on the memmapix -- so I guess they had to introduce some changes to memap2 in order to do that(?).

But obviously it would be more resilient if the underlying libs converge -- so they are maintained and looked at by more people.

CC: @lseelenbinder

so I guess they had to introduce some changes to memap2 in order to do that(?).

Looking at the complete diff against this repository, this does not seem to be the case. The only API change appears to be swapping out our own enum Advice for rustix's enum Advice.

async memmap support

Here my first question would be what that means in practical terms?

The only asynchronous operation I can think of supported by the system call interface is flushing/syncing, but that would still not make a good future as there is not notification mechanism as for when the page cache is done flushing the changes to disk.

Here my first question would be what that means in practical terms?

For example, the "async/Tokio" support in fmmap basically goes through the impls for Cursor<&[u8]>, e.g. AsyncRead, which means that using this to access file-backed memory maps in an async code base is

  • a rather convoluted way of increasing compile times as it will end up calling exactly the same code than the sync version just wrapped in a layer of immediately ready futures
  • and actually hides that this can still block just as much as the sync version because accessing the byte slice is what would block through the process incurring a page fault.

EDIT: Even AsyncMmapFileMutExt::flush_async is not async/a future which appears consistent as the system call layer is lacking a notification mechanism by which an executor could be informed that the flushing is done.

nyurik commented

Per @lseelenbinder reply here, any time a blocking call is made in an async environment (e.g. a web server), one should use Tokio's block_in_place:

In general, issuing a blocking call or performing a lot of compute in a future without yielding is problematic, as it may prevent the executor from driving other tasks forward. Calling this function informs the executor that the currently executing task is about to block the thread, so the executor is able to hand off any other tasks it has to a new worker thread before that happens.

So while you are correct that the underlying API is not async, the higher level code would much rather use it in async to avoid performance bottlenecks -- which means the higher level memmap wrapper like fmmap should do all that bookkeeping.

which means the higher level memmap wrapper like fmmap should do all that bookkeeping.

Having a crate which wraps accesses to file-backed memory maps into block_in_place or spawn_blocking seems reasonable, but fmmap does not appear to be that crate. And more importantly, the API of the underlying system call wrapper is not involved in this at all, i.e. blocking could be that crate because no cooperation by the mmap wrapper is required.

As the troubles around the tokio::fs module show, the above strategy (wrapping individual calls using block_in_place/spawn_blocking is not a recipe for the best possible performance. It is usually preferable to wrap high-level application-specific operations which make multiple blocking system calls to amortize the overhead of these wrappers.

They also have significant downsides: block_in_place will by its nature block the whole task not just the current future while spawn_blocking cannot borrow from the stack. So it is often application-specific which one should be used.

actually hides that this can still block just as much as the sync version because accessing the byte slice is what would block through the process incurring a page fault.

Note that this is actually worse than just calling a blocking system call like read on an executor thread: The page fault will block the whole process, not just the one thread. So this is a pretty big foot gun.

That's a good observation @adamreichold—I had assumed fmmap was a sane async layer over a generic mmap, but if not, it's probably worth our effort in adding a library-level handling of it, since as you observe, having a full process block when accessing a &[u8] is very much a foot-gun.

That's a good observation @adamreichold—I had assumed fmmap was a sane async layer over a generic mmap, but if not, it's probably worth our effort in adding a library-level handling of it, since as you observe, having a full process block when accessing a &[u8] is very much a foot-gun.

This would not be an automatic conclusion for me personally: Since the underlying system calls do not support notifying an I/O reactor and usage of spawn_blocking/block_in_place is usually better done at the application level instead of trying to shoehorn it into a library, I would say that using the sync interface directly would be my personal preference in this case, c.f. #92 (comment)

My arguments from the last time this was brought up still stand. The benefits of rustix outweigh the costs on almost all fronts. Aside from the more practical benefits (lower instruction counts, less dynamic linking), it also removes the last required C dependency of resvg. This would truly make it a single static executable, at least on Linux.

The main downside of rustix is that it has more frequent updates than libc. As a true rustix patriot I would be happy to keep rustix up to date in this case.

This would truly make it a single static executable, at least on Linux.

Using libc is not blocking this as the same effect can be achieved by statically linking Musl.

nyurik commented

I might be misunderstanding the issue, but it seems adding optional rustix support will allow users to have a C-free binary, which is desirable in some cases, especially when one tries to limit unsafe code

will allow users to have a C-free binary,

While I do see the build-time simplifications of not needing a working C toolchain, this is usually a given on Linux.

To be honest, the aim of being "free of C code" seems a bit religious to me. Easily interfacing with code written in other languages whether it is via linking, exchanging data via file systems and networks or just running on a kernel written in C is a feature, not a bug IMHO.

especially when one tries to limit unsafe code

The unsafe code required to interface with the mmap system call is trivial compared to the safety issues when one actually wants to use a memory map which will always require unsafe code as the relevant invariants are outside of the compiler's purview.

nyurik commented

I agree that we should try to stay away from the religious aspect:) my thinking is that if someone needs to use rustix based code, it would be silly to make libc a hard requirement of a small mmap wrapper lib if there is enough interest to add and maintain that addition.

Having a separate independent rustix based code simply moves the complexity of maintaining compatible interface to some wrapper or each lib user - not good for ecosystem health (plus it hinders rustix development because it adds another barrier of entry)

my thinking is that if someone needs to use rustix based code, it would be silly to make libc a hard requirement of a small mmap wrapper lib if there is enough interest to add and maintain that addition.

This is basically the situation in which #89 stalled: Complicating the maintenance of this crate due to internal requirements without further explanation of the rationale behind those requirements is not a convincing proposition.

Which leaves us with

(lower instruction counts, less dynamic linking)1

and

the build-time simplifications of not needing a working C toolchain

as the main tangible benefits.

Weighing these against the increase in complexity for this code base can go either way, but it is not automatic IMHO. Especially since we would most likely want to keep both code paths alive, the one based on libc and the one based on rustix and hence would not be able to e.g. expose rustix types in our API like memmapix.

Personally, I am not strictly against adding a rustix code path, but I am somewhat allergic to the amount of zealotry and lack of measurements in the discussions about this, with the whole tangent about fmmap not helping to build confidence on my part.

Footnotes

  1. I feel this needs to be qualified: mmap is decidedly not a light-weight system call as changing a process' address space to establish a mapping is a stop-the-world kind of event which can lead to serious scalability problems. This is why e.g. ripgrep has a heuristic to only use memory maps when there is a chance this cost is amortized while searching large-enough files. And using memory maps to do I/O in a multi-threaded server can become a scalability bottleneck around the kernel's mmap_sem semaphore. So I think the gains in instructions spent on the system call wrapper are most likely insignificant compared to the time spent inside the kernel actually modifying the process' address space.

@lseelenbinder Uff, I just noticed that what we discussed has another implication: While page faults block the whole process, thread-based tools like spawn_blocking/block_in_place do not really accomplish anything as it does not matter which thread triggers the page fault as long as they share an address space. So this further constrains the design available for async wrappers of the mmap system call and makes a simple wrapping of the existing methods rather pointless. I guess the best advice ATM is to either just access the mapping from the executor threads and hope that madvise and read-ahead work out or to use a separate process mapping the same file and pulling it into the page cache.1

Footnotes

  1. The latter option will require communicating the pages to access to that process and waiting until they are available, e.g. using a pipe. I suspect that the necessary system calls make this is rather specialized situation though as I/O system calls are usually what one would want to avoid by using mmap in the first place. (Of course, the control data is much less than the actual pages themselves, but the per-call overhead remains.)

I fully agree with @adamreichold here.

First, I definitely do not plan to support rustix. Neither as an optional decency, nor as the main one. Fork is the best solution here. I will remind you that this crate is a fork as well.
And supporting libc and rustix at the same time is no go either. Too much complexity. This crate is already quite complicated for something that should be just a simple function wrapper. But this is the reality. Just look at this crate's issues and commits - it's full of bizarre edge-cases. And with rustix we would have to multiply it by 2.
Heck, this crate already has 216 commits. Just think about it.

Second, I personally don't see any benefits in using rustix in terms of this crate. Memory-mapping isn't a bottle-neck.
Lack of C dependency? It's not a C dependency, it's a system library with a C API. We're not compiling anything, just linking.
Not to mention that on Windows we still need libc, no?

Third, libc is the way to interact with an OS. Everything else is hack.
I can guarantee you that the day we switch to rustix there would be a swarm of people complaining that their bizarre target, like AIX or OS/360, have stopped working. Sure, we do not officially support them right now as well, but at least they should work.

PS: I do not use or familiar with async, therefore I cannot comment this aspect, but I don't see how rustix can solve it.

Since the async support in fmmap turned out to be smoke and mirrors, I'd like to suggest we close this in favour of #89 which we keep open to continue tracking possible future usage of rustix.

@lseelenbinder Uff, I just noticed that what we discussed has another implication: While page faults block the whole process, thread-based tools like spawn_blocking/block_in_place do not really accomplish anything as it does not matter which thread triggers the page fault as long as they share an address space. So this further constrains the design available for async wrappers of the mmap system call and makes a simple wrapping of the existing methods rather pointless. I guess the best advice ATM is to either just access the mapping from the executor threads and hope that madvise and read-ahead work out or to use a separate process mapping the same file and pulling it into the page cache.

@adamreichold I searched without success for verification of the "block the whole process" bit—because it surprised me. I personally find it unintuitive, because I'm most familiar with high performance mmaps in Varnish cache, and it does not use multiple processes IIUC. Do you have any handy references on this?

Assuming it's true, we're pretty much dead-in-the-water with regard to mmap every playing nicely with async in Rust (IMO requiring another process, even if it's hidden in a nice crate implementation, breaks a lot of expectations, removes the advantages of mmap, and leaves us worse off than using normal io::Read and io::Write calls).

If all that is true, I'll drop using mmap entirely from the upstream crate, because it's, as you say, smoke and mirrors, and likely to become a serious bottleneck when we hit the scale we anticipate. Effectively, a properly targeted read_exact() with a seek() is going to scale much better, and still have the page cache to help out.

I searched without success for verification of the "block the whole process" bit—because it surprised me. I personally find it unintuitive, because I'm most familiar with high performance mmaps in Varnish cache, and it does not use multiple processes IIUC. Do you have any handy references on this?

I think https://lwn.net/Articles/932298/ is a nice up-to-date entry point for this particular topic. I guess the situation is also bit a more nuanced than "always block the whole process" but rather "only one file-backed page fault can happen per process" which will only block the whole process when everyone tries to resolve page faults.

The underlying issue is that these address space changes currently rely on the single mmap_sem lock and that the file I/O to pull pages into the page cache still happens while that lock is held. So it should mostly be fine if the page faults are isolated in a single thread (multiple threads like in a thread pool would not help but also not hurt except for their overhead) and e.g. the executor threads should try hard to avoid trigger page faults for any reason.

Finally, as discussed in the linked article work is underway to alleviate this issue but it just now being implemented.

Personally, if scalable file I/O for an async service is the target, I would still recommend avoiding mmap because of how the system call does not really integrate with existing I/O reactors, i.e. even if the scalability bottlenecks are removed, the system call layer is still fundamentally blocking. I think the most future-proof alternative for file I/O would probably be to use I/O uring via e.g. tokio-uring or glommio.

Effectively, a properly targeted read_exact() with a seek() is going to scale much better, and still have the page cache to help out.

Not that you can reduce the number of system calls using positioned I/O, e.g. pread via read_exact_at.

@adamreichold Thank you—I appreciate the time you've taken to dig through these issues with all of us! That makes a lot more sense now—and lock contention is certainly the bane of high performance scalability in my experience.

Not that you can reduce the number of system calls using positioned I/O, e.g. pread via read_exact_at.

TIL. Really helpful.

lock contention is certainly the bane of high performance scalability in my experience.

Note that the status quo is considerably worse than lock contention, i.e. the communication overhead of many threads of execution acquiring and releasing a lock which is especially problematic when the critical sections are short. For now, the main scalability issue for page faults is doing I/O while holding the lock.

This means that the threads not only fight for the cache line holding the futex and queue up to access it, but it also means that I/O is serialized which means queue depths cannot be filled and I/O device utilization stays low. Which is especially problematic with modern flash devices which require a certain queue depth before becoming effective. This is also particularly vexing as a high degree of I/O parallelism is usually a benefit of memory maps when the kernel issues multiple readahead requests on its own while the mapping process can continue with its work.