cloudius-systems/osv

Lazily allocate thread stacks

nyh opened this issue · 9 comments

nyh commented

We currently allocate for each pthread, by default, a 1 MB stack. This allocation is done with mmap with mmu::mmap_populate, i.e., we force the entire allocation to be done up-front. This is a terrible waste: thread stacks are an excellent candidate for lazy allocation - they cannot use huge-pages anyway (they are less than 2MB, and have a guard page), and they often use much less memory than the full 1MB allotment.

Unfortunately, OSv crashes with lazy-population of stack threads, e.g., in tst-pipe.so, as explained in commit 41efdc1 which made the allocation immediate (with mmu::mmap_populate). The problem is that with lazy allocation of stacks, any code can cause us to need to allocate one more page for the stack, and cause a page fault, which may sleep (waiting on a mutex for memory allocation, for example). The crash happens when this happens in some OSv "kernel" code which has preemption disabled or irq disabled but runs on the user's stack (e.g., RCU lock, wait_until, the scheduler, or numerous other places), and a sleep is not allowed.

On the OSv mailing list, I proposed the following workaround: In preempt_disable() and irq_disable(), before we disable preemption, read a byte 4096 bytes ahead of the current stack top. Usually, this will add a tiny performance penalty (a single read, not even a if()), but if the next page of the stack is not yet allocated, it will cause the page fault to be taken here and now, before we disable preemption.
We also need to read a byte from the stack in thread::init_stack(), because new threads start with preemption disabled so we need them to start with a minimal part of the stack allocated.

If we want more than 4096 bytes of available stack to be guaranteed, we can read several bytes at 4096-byte stride, but this is slower than reading a single byte. Instead, we can consider doing this: read just one byte a few pages from the current stack position, and modify the page-fault handler to allocate more than one adjacent page on each page fault.

On the mailing list, Avi Kivity suggested other options to solve our lazily-allocated stack problem:

  • insert "thunk code" between user and kernel code that switches the stacks to known resident stacks. We could abuse the elf linker code to do that for us, at run time.
  • use -fsplit-stack to allow a dynamically allocated, discontiguous stack on physical memory
nyh commented

Just as an example of the kind of savings I am hoping to get from this:

Right now, in even a small application ("make image=rogue", for example) I see 180 threads.
Threads have variable stack sizes (some are 4K, pthreads are 1MB by default), but the vast majority of these 180 threads are ZFS threads which have a 65 KB stack.
If half of this stack is unused (just a ballpark guestimate...), we have 32 KB wasted for each of the 180 threads. That's almost 6MB, which is 13% of the memory used by this tiny application (I can run it on a 46 MB guest).

The memory saving will be much larger in applications which use many pthreads, e.g., heavily threaded Java applications.

nyh commented

In this mailing-list thread, https://groups.google.com/d/msg/osv-dev/LqUAWjzVpSc/Yf4bTdi0AwAJ, Roberto and Waldek noticed that a Java program with 400 user threads uses over half a gigabyte just for these threads' stack, and on Linux this doesn't happen. It happens because of this issue (Java's default stack size is 1MB).

This seems like a quite important issue to fix from memory utilization standpoint. I wonder what the pros and cons of the 3 options are:

  • single-byte-read page fault when stack
    • one obvious downside is that if an application wants to use more than single page (or other buffer we want to keep in advance) this will not work
  • "thunk code"
    • how would we implement it?
  • -fsplit-stack
nyh commented

Hi @wkozaczuk my andswers and recommendations:

In the single-byte-read page fault the issue is not the application wanting to use more than a single page, but the kernel. It's not even all the kernel code - it's only preemption-disabled parts of the kernel which will be limited to 4096 bytes of stack. I'm hoping that most the kernel code does not have preemption disabled, so maybe 4096 bytes will be enough, but I really don't know.

I also suggested how we can read more than one byte at 4096-byte interviews to pre-fault more than one page (at some performance cost), but I think the best option is perhaps the one I already mentioned - of bringing in several adjacent pages on one single page fault. In other words, the code reads just one byte 4096 bytes back (stack grows downward...), but if this one read gets a page fault, the page fault handler will allocate not only this one page, but also the previous few pages, if they are still part of the same mapped range. Or, since this feature is very stack-specific we can do this only for stacks: currently, libc_flags_to_mmap() converts MAP_STACK to "mmu::mmap_populate". It should convert it to a new "mmu::mmap_stack", and the page fault handler will only do the "populate a few pages below the faulting one" trick when it sees a mapping with the mmu::mmap_stack parameter. Other OSv stack-allocation code which today uses mmu::mmap_populate (see commit 41efdc1) will also need to use the new mmu::mmap_stack instead.

About think-code - I think it's complicated we should only look into it if we're out of other options.
About "-fsplit-stack" - I have no idea how it works, and if it does work on OSv. But even if it did, how would Java use it? I thought Java itself determined that we allocate 1MB stacks for its threads, and doesn't use "split stack" features.

@nyh So it looks your proposal is probably the best way to go. As I understand it would not only apply to application threads (where app and OSv share same stack anyway) but also to pure kernel threads (like ZFS ones), correct? In theory same solution could be applied to syscall stack which we malloc on first syscall execution.

nyh commented

@wkozaczuk this issue is relevant to stacks allocated by mmap(), not by malloc(). All our Posix threads used by applications use mmap, because our pthread::allocate_stack()usesmmu::map_anon(nullptr, size, mmu::mmap_populate, ...)` which is also an mmap (and, as I said above, uses explicitly mmap_populate for immediate population and it shouldn't, perhaps it should use a new mmu::mmap_stack).

Most other internal threads in OSv has smaller stackes which are allocated by malloc(), not by mmap(). See init_stack() in arch/x64/arch-switch.hh. These always allocate immediately and aren't lazily populated, and the thinking was that this is fine for the small stacks (usually 64 KB) we use there. So nothing needs to change there, I think.

Some takeaways from discussing the ongoing implementation effort - https://groups.google.com/d/msg/osv-dev/oaB5CHThc7M/2JCIvzgpAQAJ:

  1. We need to handle the nesting scenario when preempt_disable might be called after irq_disable was already called and vice versa. This means we cannot read a byte from the page ahead blindly as the stack may have already advanced (gotten deeper) and -4096(%rsp) could lead to another page fault which we cannot handle. This can be prevented by introducing separate stack_page_read_counter that we would increment and decrement in right places around disabling/enabling preemption/interrupts and trigger page read if the counter is 1 (look for a WIP patch here).

  2. All the increments and decrements of the stack_page_read_counter should be symmetrical. Decrementing it in wait_for_interrupts() does not seem to but if I remember correctly when I was trying to get it all to work I had to put it there, maybe because we start with 1. But I am not longer convinced it is necessary. We need to better understand it.

  3. Ideally, we should not even be trying to read the next page on kernel threads or even better on threads with stack mapped with mmap_stack. That can be accomplished by initializing the stack_page_read_counter to some higher value than 1 (10 or 11) so it never reaches 0 to trigger page read. This possibly can be set as a 1 thing in the thread_main_c method (see arch/x64/arch-switch.hh). It received thread pointer so we should somehow be able to determine how its stack got constructed. If not we can add a boolean flag to the thread class.

  4. In theory, we could get by without stack_page_read_counter with just checking sched::preemptable() AND somehow read the flags register (PUSHF?) to directly check if the interrupt flag is set or not in the read_stack_page method. But I have a feeling it would be more expensive.

  5. Using the counter may not be that expensive given we already use a similar mechanism to implement preemptable().

@nyh Do you have any comments about these most recent takeaways and a solution proposed on the mailing list?

This has been implemented by this series of commits - https://github.com/search?q=repo%3Acloudius-systems%2Fosv+%22lazy+stack%22&type=commits. Even though the lazy stack support is not enabled by default (see the lazy stack flag in conf/base.mk) because we deem it as experimental, I suggest we close this issue as implemented.