occlum/ngo

[RFC] Memory Cleaning Threads

jessehui opened this issue · 0 comments

  • Feature Name: mem_cleaning_thread
  • Start Date: 2022.2.10

Summary

Use a number of kernel threads to clean munmap-ed pages in an asynchronous way.

Motivation

Due to the lack of the page table, for applications running inside Occlum, when the user mmap a range of memory, the pages corresponding to the memory are actually committed and require clean-up even if the user never uses the memory.
For some applications, it will try to mmap a big range of memory, do nothing on it and just unmap the memory. In this case, the clean-up process will consume a lot of time and can potentially cause the application to misbehave.
This design is trying to reduce the time of munmap to minimize this gap.

Guide-level explanation

This feature is transparent to end-users.

Reference-level explanation

  1. Define the clean request.
    We use a clean request to define the cleaning process of a range. A clean request can be generated when munmap a range. And then the request is responded by the clean worker to do the actual cleaning (i.e. set 0). And when the cleaning is done, the clean worker will put the range back to the free space manager for further allocation.

  2. Define the structure to handle the clean request
    There could be multiple threads from multiple processes munmap-ing different ranges. And there can be at most half of the vCPU number of threads doing the cleaning. Thus, I choose to use MPMC queue provided by flume.

  3. Define the rule for the clean request

  • If clean request < CLEANING_TASK_THREASHOLD, clean by the requesting thread itself.
    Only if the clean request is greater than CLEANING_TASK_THREASHOLD, use clean thread to do the cleaning. Otherwise, the current thread must do it on its own. Because cleaning small range is fast, and sending to other threads will introduce more overhead to the entire system.
  • If CLEANING_TASK_THREASHOLD < request < REQ_MAX_SIZE, the clean thread will do the cleaning as a single request (use one single thread to do the cleaning)
  • If REQ_MAX_SIZE < request, split the request into multiple small requests of REQ_MAX_SIZE and make several cleaning threads to do the cleaning
  • If there are too many clean requests in the queue which exceeds the high watermark, don't send clean requests. Clean by the requesting thread itself.
  1. Define the behavior of the clean threads
    At the init stage, create half of the vCPU number of threads. And if a worker thread can't receive requests, it will wait. Once a request is sent by requesting thread, it will also wake one worker.
    The clean workers should be run with low priority. If there are many user tasks pending, the clean worker shouldn't clean anymore.

  2. What if the free space is not enough
    For mmap, if the desired memory is not available while there are many requests in the queue, the allocation thread should become a worker thread and help respond to the requests in the queue until the queue is empty. And then it will try to allocate again. And if it still fails this time, return with the error number.

Drawbacks

There is one limitation. The mmap syscall is defined like this:

void *mmap(void *addr, size_t length, int prot, int flags,
                  int fd, off_t offset);

When the user calls mmap with specified addr but MAP_FIXED is not set in flags, this means the user assumes addr is a good place and it would be better for the kernel to allocate memory from that address but not forcing the kernel to allocate from that address.

For example, the user just munmap a range, and he could assume this range is now free. Later the user wants to mmap a range and would like to just allocate from that address again.

If no other threads are doing mmap, this operation should always succeed. However, with this feature, even if the user munmap a range, he may not mmap to get the same range later because maybe that range is still being cleaned. This clean time depends on the range of the memory. And when the range is cleaned, then the user will be able to mmap from that address again.

This should be fixed.

UPDATE:
This limitation has been removed by tracking the ranges being cleaned of each worker. And when mmap with address is called, the calling thread will become the help worker to help clean all pending requests. And then, check all the ranges being cleaned by other workers to see if the desired range is overlapped and if so, loop and wait. If not, return and start the allocation.

Alternatives

Instead of MPMC to track dirty ranges, use a btree to track all the dirty ranges in address order. Choose btree because it is provided by rust std and has relatively good performance in inserting, removing and finding. Also, each request will have a status indicating if it is dirty, being cleaned, or already clean.

When munmap a range, make it a clean request and mark the status as dirty, and put it to the btree. Then the clean worker can iterate the btree to find the dirty ranges and do the cleaning. And if there are no dirty ranges, put the range back to the free list and merge to bigger free ranges.

This method can handle the mmap with address better but the overall performance is even poorer than not doing this. The reason could be that the clean worker has to iterate the whole btree every time. When the btree is big, this can be very time-consuming. If a request is cleaned and then just remove from the btree, then the btree needs to be unlocked very often and multiple workers will need to wait for the lock.

The PoC implementation can be found here: https://github.com/jessehui/ngo/tree/dev_cleaning_thread_btree

Future Possibilities

With EDMM and trusted page fault, we can achieve more accurate page cleaning.
All the VMs allocated are set to PROT_NONE but recorded the protection bit by the kernel, and once read/write, a page fault is triggered and we can modify the protection bit in the page fault handler and set it to the recorded protection bit and also set the range/page to dirty.
And when the range is munmap-ed, we only need to clean if the range is set to dirty.

Status

Under review: #212