/rust-os

Creating a bare metal OS in Rust.

Primary LanguageRust

Rust OS

Inspired by Writing an OS in Rust and https://github.com/mrgian/felix.

Features

  • Preemptive multi-tasking
  • Userspace (with syscalls!)
  • Higher half kernel with per-task page tables
  • ELF parsing/execution
  • Symmetric multi-processing (multiple CPUs)
  • Per-cpu variables (similar to Linux using gs register and special linker area)
  • PCI: discovery, registration, MSI-X
  • VirtIO: rng device, block device, generic queues, per-driver feature negotiation
  • Framebuffer support with simple font
  • Virtual Filesystem Layer, including physical buffer caching and on demand flushing back to disk
  • ext2 filesystem
  • sysfs-style virtual filesystem
  • "Platform" drivers for HPET, IOAPIC, LAPIC, ACPI
  • Kernel shell (with colors!)
  • Paging
  • Bitmap-based physical memory allocator
  • Safe synchronization primitives (spinlock that can disable preemption and interrupts, mutex with sleeping, atomics, various "cell" types that act as safe channels and queues)
  • Custom "register" and "register array" primitives for volatile reads/writes to specific memory locations
  • Custom test system where tests are annotated with #[kernel_test], compiled into special linker area, and then run by iterating over that area
  • Logging
  • Dynamic interrupt registration
  • Stack traces using ELF symbols and addresses
  • Keyboard support
  • Serial console support

demo

Running in QEMU

Default debug mode:

$ make run

Release mode:

$ make run RUST_BUILD_MODE=release

UEFI disabled (use BIOS):

$ make run UEFI=off

Add VGA graphics:

$ make run GRAPHICS=on

Provide command line arguments:

$ make run CMDLINE='hello world'

Run a bunch of shell commands separated by ;. Here is a nice "bells and whistles" command that mounts the test ext2 volume, writes "hello!" to the framebuffer, executes 20 instances of userspace/primes, runs all tests, and does this with UEFI disabled and kernel code compiled with release mode.

$ make run CMDLINE='mount 2; write-framebuffer hello!; exec 20 /bin/primes 15000; test' UEFI=off RUST_BUILD_MODE=release GRAPHICS=on

QEMU interaction

I tend to prefer -nographic mode (GRAPHICS=off in the Makefile, the default) because it is easier to interact with the QEMU monitor and there isn't a distracting window. Once I have actual graphics besides hello world VGA text I'll prefer the graphical window.

Keybindings:

Debugging kernel with GDB

In one terminal:

make run-debug

In another terminal:

make gdb

Make sure to read resource section below on using GDB with QEMU! In particular, if you are running with accel=kvm, use hbreak instead of break to set a breakpoint before the kernel starts and has page tables set up.

Debugging userspace with gdb

You can use the add-symbol-file command in GDB to add userspace programs to the symbol table, so when you are stepping through that code GDB knows how to map instructions to source code. For example, in GDB you can do add-symbol file userspace/hello/hello to load the debugging info for the hello binary. Then you can do b hello.asm:6 and GDB will know where to stop when you run exec /bin/hello in the shell.

Debugging QEMU with GDB

If you want to debug QEMU itself with GDB, you can run:

make run RUN_QEMU_GDB=yes

This can be very useful if you want to figure out why QEMU is doing something funky. (I originally created this to debug why QEMU reported Invalid write at addr 0xFEE00000, size 4, region '(null)', reason: rejected whenever MSI-X tried to write an interrupt to 0xFEE00000 only in legacy boot mode, but not in UEFI mode.)

Tests

Note that we don't use a Cargo workspace (I turned it off because LSP/Emacs didn't seem to work well under it, and it isn't clear how cargo configs should work across workspaces, e.g. rust-lang/cargo#7004 and https://rustwiki.org/en/cargo/reference/config.html). That means cargo test doesn't work at the top level. Also, we don't use Rust's testing library for kernel code. Instead we do more integration-style tests.

make test

TODO

  • VFS read/write code:
    • Change inode write API to be similar to read (based on blocks)
    • Nuke old block iteration code in ext2 now that write and read don't need it
    • Test that reading indirect block works (can this be made an integration test? make a huge file in the test ext2 volume and assert some value at a deep offset. Have a TODO to do a write than a read as part of the integration test)
    • Clean up and refactor new read/write code and write tests. In particular, I don't like all of the inline byte/offset <-> block math. Put that in some tested pure functions.
  • Tests: Add thorough unit test suite we can trigger with shell command.
    • Consider combining all crates into kernel again now that we support tests
      • Make sure the bitmap-alloc proptest tests are still useful! Force a few failures. I'm a bit worried that proptest w/ no_std and panic == abort isn't useful
    • Have a way to run tests on boot and return the QEMU exit code with the result. Just short circuit to running tests instead of the shell.
    • Integration tests:
      • Spawn a bunch of processes and hope we don't crash?
      • maybe some expected failures to ensure we call panic handler?
  • GSBase for usermode: we need to make sure we swap the usermode GS segment base when doing task switching
  • Memory management
    • Consider locking global physical page allocator inside page table functions so we don't need to pass in &mut PhysicalMemoryAllocator, and then most of mapping.rs can go away.
      • Then again, the explicitness is nice. Maybe make helpers to take both a page table lock and an allocator lock?
    • Page type improvements
      • Make typed page sizes like the x86_64 crate does
    • Add support for huge pages in map_to, and then use them for both the heap and mapping physical memory
    • Map all physical memory starting at 0xffff_8000_0000_0000. Limine just does 4 GiB, but make sure to do it all.
      • Keep the first page at 0x0 unmapped though!
    • Free all levels of page tables that are no longer in use (including the top level!) like when all entries are no longer in use. This must be done recursively up the tree. This might be easier with a more general struct per page with reference counts, like Linux's struct page.
      • A less general and simpler problem is freeing an entire page table, like when a process exits. When we do this though we have to ensure we don't free the kernel pages, just the user process pages!
    • Make our own PhysAddr and don't allow it to be converted to a pointer via as_ptr() (the x86_64 one doesn't have this btw)
    • Consider removing as_u64 for all address types, because it makes mistakes too easy.
    • Guard pages: consider using one of the special OS-available bits on pages for GUARD_PAGE, in case that could simplify our guard page detection logic in the page fault handler. Using these OS-available bits in general to identify the type of page is probably going to be useful.
      • This will require not simply "unmapping" a page for the guard page, but to add some sort of "unmap with flags", or mapping "to" physical address 0 with flags (this is what we used to do)
    • Linux prefers to use physical allocation in the kernel by default (kmalloc) because it is faster than virtual allocation (vmalloc) because vmalloc needs to mess with page tables. Vmalloc is only used when you need a huge chunk of memory that might be hard to get physically contiguous.
    • Linux keeps a 40 (64?) byte page struct per physical page of memory. That is way larger than my 1 bit! It only takes 40MB of a 4GB system (assuming 4kB pages) which might be an acceptable tradeoff.
    • Intermediate page table flags: we need to make sure that e.g. if a leaf entry is intended to be writable, then parent tables are marked writable too, especially if they already exist.
      • Perhaps we should always have high half page tables have WRITABLE | PRESENT, and lower half has WRITABLE | PRESENT | USER_ACCESSIBLE. Nice and easy.
    • Page table concurrency:
      • Consider representing each PageTableEntry as AtomicU64, or in the page table as AtomicInt<u64, PageTableEntry>
  • Userspace
    • Write a barebones Rust userspace program. Should it share code with the kernel, even just e.g. an interface module for syscalls?
    • Drop any memory we allocated for task, like task segment pages
      • Also drop any intermediate page tables we created.
      • Would it be easier to create an arena holding a process's memory so we could drop it all?
    • Make sure we can use NO_EXECUTE bit in page table (need some EFER setting?)
    • Re-enable interrupts while handling syscalls (or don't? at least be explicit)
      • If we expect interrupts to be disabled, make a comment where we disabled and where we do e.g. swapgs or something else that expects interrupts disabled
    • Segfault a user process and kill it instead of panicking and crashing the kernel
      • Be careful about locking the scheduler in the page fault handler. It is possible a spin lock was already taken on the scheduler and we'll deadlock (all though that shouldn't happen on the current CPU. Hmm)
    • Create a type showing the intended memory mapping of a process and turn that into a page table. This should make it easier to reason about the memory map.
  • (maybe not needed) Replace x86_64 crate IDT and rust extern "x86-interrupt" calls with my own assembly wrappers so I have finer control over registers, swapgs, etc
  • Per CPU
    • Create abstraction to make arrays of per cpu variables so GDT code is cleaner (although the percpu init function needs the GDT to be created already, for some reason)
      • Ensure cache line alignment!
      • Alternatively, support percpu structs, ensuring they are cache aligned. We should be able to either read/write the whole struct (wrapped in a PreemptGuard) or individual fields.
    • Maybe have a helper to take locks for multiple CPUs in a consistent way to prevent deadlocks, like ordering by processor ID. (Linux scheduler code does this for per CPU run queues)
    • Logging dependency on percpu:
      • Consider making init dependencies more explicit, by passing around a thing that was initialized, or even dumb tokens like struct PerCPUInitialized;.
      • Dep exists because the logger uses a spin lock, which modifies the percpu variable, logging depends on percpu being set up. Ensure percpu is set up before logging, or disable the logging spin locks until bootstrapping is done.
      • Maybe add debug_assert! calls ensuring percpu has been initialized before any percpu vars are used. (Make sure they get removed in release builds)
      • Linux has an early printk function, maybe for stuff like this?
      • Logging in already "slow" by kernel standards. Maybe it is okay to do some boolean check to take locks or not.
  • Arc memory leak detection:
    • Calling run_scheduler() (or more specifically switch_to_task) while holding an Arc reference (especially Arc<Task>) can cause a memory leak because we might switch away from the given task forever. Currently I manually drop things before calling these functions. Is there a way I could make calling run_scheduler basically impossible?
      • Same problem happens when jumping to userspace for the first time.
    • Find a way to detect leaked tasks, or maybe debug Arc leaks in general.
  • Networking
  • Filesystem
    • Writes
      • Instead of providing a block iterator function, have existing users request specific blocks. They can do iteration on their own. Then a user can request a new block at a specific location if it needs to do a write and a block doesn't exist.
      • Adding blocks to a file (maybe use some lorem ipsum generator or something to make up text of a given length, or embed some out-of-copyright literature in the binary)
        • Also ensure we add a block to directory inodes if we are trying to add a new directory entry and there isn't enough space. Might need a special constructor on DirectoryEntry for this case.
        • Make a function to find a new block given a desired block. It can be very dumb for now and in the future it can be smarter, but try to make a good interface for future improvements.
          • One simple but likely very effective algorithm is to first try and allocate a free block right after the previous block, and I'd that fails search for free blocks by byte in the bitmap (which would fine 8 free blocks in a row, and is super fast).
      • Appending: make sure we use the file length module block size to index into blocks. Don't just iterate over blocks.
      • Inode stats: ensure blocks and size_low/size_high fields are kept up to date
      • Creating a new file
    • Deletes (delete an entire file, unmark all inodes and blocks, etc)
    • Nested mountpoints, e.g. mount ext2 at root and then sysfs at /sys
      • Add mountpoint argument to mount and ensure parent directory exists (or mountpoint is /)
      • How do we ensure that when we do ls / we see /sys?
    • Ensure overwriting file properly truncates all blocks first by marking them as free and removing them from inode block pointers
    • Sysfs ideas: pci devices, virtio devices, memory info
    • Instead of returning Vec for directories, consider returning an impl Iterator (except you probably can't do that with traits...)
  • GDB:
  • Serial port:
    • find a way to implement using &mut and locking without deadlocks from e.g. non-maskable interrupts, holding the lock in the shell while trying to debug print inside kernel code, etc.
    • Consider multiple serial ports: one that spits out logs from the kernel, and one dedicated to the shell.
  • Spurious wakeups: refactor killing and sleeping so we don't rely on never having spurious wakeups, and so we don't need to rely on &mut self for scheduler to immediately run scheduler just once (we should run scheduler in a loop in case of spurious wakeup).
  • Task struct access: investigate not hiding all tasks (or just the current tasks) inside the big scheduler lock. Are there situations where it is okay to modify a task if the scheduler is running concurrently? Can we lock individual tasks? Is this inviting a deadlock?
    • For example, putting a task to sleep or waking it up. Is this bad to do concurrently with the scheduler? Maybe instead of calling this the "state" it can be thought of as a "state intent", which the scheduler should action next time it changes the task's scheduling. Wait queues and channels do this, but they need a scheduler lock under the hood.
    • Consider using per CPU for storing the currently running task instead of having a Vec of those in Scheduler
      • The "current task" is only valid in the current thread. We need to wrap it in a type that is not Send or Sync; we can't just return a &'static ref (or maybe we can if we make sa
  • Stack size: figure out why stacks need to be so large when compiling in debug mode. Is Rust putting a ton of debug info on the stack?
  • Deadlock debugging: find a way to detect deadlocks and print the locks involved
    • Linux has a neat debugging system for mutexes https://docs.kernel.org/locking/mutex-design.html#semantics
    • Should we fail if we are holding a spinlock for too long?
    • Consider naming spinlocks, and having the lock holder put their name once they take the lock. Then if we fail we can dump all of this info.
    • Have a way to dump stack trace from all threads on demand. If we deadlock across many threads, then have a way to do this in panic.
    • Have tasks record what they are waiting on when they go to sleep, even if it is just a String
    • Long shot: have a way to detect if we are in an atomic context, like an interrupt or exception context. Then, whenever we take a lock, record the context of the holder. Then, if we try to take a lock from an atomic context, and the current holder is in a non-atomic context, panic, admonishing us to ensure we use lock_disable_interrupts.
      • Recording the current context in general seems like a really neat idea, and a great way to debug things. For example, it could be a way to forbid memory allocation in interrupt handlers, try to do anything with current_task (like set sleep, even fetch current task, etc) inside of interrupts, trying to user percpu before it is set up (maybe we set context for "bootstrap stage"? then we could pick the correct logging implementation depending on stage? Also if we decide to oops or panic we could set a flag so many things like locks know to give up) etc (I can't think of more ideas right now but I know I've wanted this)
  • Pre-process DWARF info to use for stack traces so we don't need to keep frame pointers around (we currently have -C force-frame-pointers=yes)
  • Driver model:
    • Make a proper device and driver model like Linux. Then we can more easily populate sysfs, the PCI setup code can be less ad hoc (and more focused), mounting should be easier, etc.
    • Research kobject and sysfs
    • Linux has a simple match function on each driver on a bus. To find the right driver for a device, the bus iterates over all registered drivers and calls match, and matches to the first driver that returns 1.
  • VirtIO improvements:
    • Create a physically contiguous heap, or slab allocator, or something for virtio buffer requests so we don't waste an entire page per tiny allocation.
      • Ensure we are still satisfying any alignment requirements for buffers. Read the spec!
    • Remember features we negotiate, and ensure we are accounting for the different features in the logic (especially around notifications)
  • PCI: don't do so much cloning. Have clearer ownership, and take advantage of refs and mutable refs to avoid concurrency bugs.
  • IOAPIC: Throw an error if IOAPIC number assigned to twice
  • IRQ locking:
  • Try replacing bitmap allocator with a buddy allocator, perhaps itself implemented with multiple bitmaps https://wiki.osdev.org/Page_Frame_Allocation
  • registers.rs and macros
    • Also document registers.rs stuff
    • Consider using a proc macro to annotate fields on structs instead of code-generating the entire struct. This might get rid of the need for my PCI wrapper types because I can inline helper functions into the actual struct, and I can also include other information in the struct. Perhaps we can also handle the "embedding" case, where e.g. we want to easily be able to include the common PCI registers in a Type0 device struct, or allow VirtIO devices to include the common registers and the type 0 registers. Also, I think proc macros are more flexible.
      • Easier to set pub, private, pub(crate), etc
      • This would also allow us to group registers into dedicated structs. For example, have a struct for grouping (device_id, vendor_id), and maybe class, subclass, and prog_if. There really is no strict need to have all registers in the same struct generated through one macro, except it makes having a single from_address function and perhaps a Debug implementation nicer.
      • Find a way to use this macro for the virtq stuff, where the rings have dynamic size and then a struct member after that.
    • I really messed up my pointer math on some structs and now I'm scared. It would be really nice to be able to rely on #[repr(C)] alignment rules, especially for VirtIO where they use C structs in the spec.
  • virtio-rng interrupt doesn't seem to fire with UEFI disabled (make run UEFI=off). Fix it.
    • virtio-blk interrupts work! Just a problem with RNG
  • Read QEMU Internals

Resources

QEMU

In the QEMU monitor (Ctrl+Alt+2 in a QEMU graphical window), these are useful for looking at devices:

(qemu) info pci
(qemu) info qtree

Finding QEMU device help

# List devices
$ qemu-system-x86_64 -device help

# Help for a specific device
$ qemu-system-x86_64 -device virtio-rng-pci,help

GDB/Debugging

In QEMU:

Multiple threads:

  • Use info threads to see rip on all CPUs, and use thread apply all to run commands across CPUs, e.g. thread apply all backtrace.

Rust OS dev

APIC and IO/APIC

PCI

Virtio

Block device:

Struct offsets

The VirtIO spec uses C structs to compute offsets. Here is a C program that shows how to compute these offsets:

#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

struct virtio_blk_config {
 uint64_t capacity;
 uint32_t size_max;
 uint32_t seg_max;
 struct virtio_blk_geometry {
  uint16_t cylinders;
  uint8_t heads;
  uint8_t sectors;
 } geometry;
 uint32_t blk_size;
 struct virtio_blk_topology {
  // # of logical blocks per physical block (log2)
  uint8_t physical_block_exp;
  // offset of first aligned logical block
  uint8_t alignment_offset;
  // suggested minimum I/O size in blocks
  uint16_t min_io_size;
  // optimal (suggested maximum) I/O size in blocks
  uint32_t opt_io_size;
 } topology;
 uint8_t writeback;
 uint8_t unused0;
 uint16_t num_queues;
 uint32_t max_discard_sectors;
 uint32_t max_discard_seg;
 uint32_t discard_sector_alignment;
 uint32_t max_write_zeroes_sectors;
 uint32_t max_write_zeroes_seg;
 uint8_t write_zeroes_may_unmap;
 uint8_t unused1[3];
 uint32_t max_secure_erase_sectors;
 uint32_t max_secure_erase_seg;
 uint32_t secure_erase_sector_alignment;
};

int main(void)
{

 printf("offsetof(capacity) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, capacity)),
 printf("offsetof(size_max) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, size_max)),
 printf("offsetof(seg_max) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, seg_max)),
 printf("offsetof(geometry) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, geometry)),
 printf("offsetof(blk_size) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, blk_size)),
 printf("offsetof(topology) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, topology)),
 printf("offsetof(writeback) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, writeback)),
 printf("offsetof(unused0) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, unused0)),
 printf("offsetof(num_queues) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, num_queues)),
 printf("offsetof(max_discard_sectors) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, max_discard_sectors)),
 printf("offsetof(max_discard_seg) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, max_discard_seg)),
 printf("offsetof(discard_sector_alignment) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, discard_sector_alignment)),
 printf("offsetof(max_write_zeroes_sectors) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, max_write_zeroes_sectors)),
 printf("offsetof(max_write_zeroes_seg) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, max_write_zeroes_seg)),
 printf("offsetof(write_zeroes_may_unmap) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, write_zeroes_may_unmap)),
 printf("offsetof(max_secure_erase_sectors) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, max_secure_erase_sectors)),
 printf("offsetof(max_secure_erase_seg) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, max_secure_erase_seg)),
 printf("offsetof(secure_erase_sector_alignment) = 0x%02X\n", (long) offsetof(struct virtio_blk_config, secure_erase_sector_alignment)),

 printf("sizeof(struct virtio_blk_config) = 0x%02X\n", (long) sizeof(struct virtio_blk_config));

 exit(EXIT_SUCCESS);
}

Volatile memory access in Rust, and spurious reads

TL;DR: Use raw pointers instead of references to memory-mapped IO regions to guarantee you won't have spurious reads. There is an excellent blog post that explains this. There is also a good forum thread explaining the dangers of this.

Interrupt/IRQ handling

How linux does things:

  • For CPU exceptions (vectors < 32), they have a hard-coded handler in the IDT
  • For external interrupts (starting at 32) Linux pre-populates a stub interrupt handler for every vector (256 - 32 of them on x86_64) that simply calls common_interrupt with the vector number.
  • Definition for common_interrupt

Other higher-level Linux resources:

Multi-tasking and context switching

Clock sources and time

Memory barriers

When performing memory-mapped device IO, it is often important to ensure that your reads and writes are performed in the order you write them in code. Using volatile reads/writes can ensure that the compiler doesn't reorder them, but the CPU may still reorder them.

Memory management

File systems

ext2:

Linux block devices:

Linux VFS:

Userspace and system calls in x86_64

Per CPU variables

ELF