zilch-lang/nstar

System calls

Opened this issue · 5 comments

Depending on a specific target, we may want to use system calls.
However, we have to type them else they would be unusable in N*.

We propose a syntax element syscall <N> : <type> to type a specific syscall (identified by N).

System calls are a bit harder to handle than what was currently thought.

Let's take the example of Linux' open (2) system call:

  • Inputs:
    • eax = 0x02
    • ebx : s32 = flags
    • ecx : umode_t = opening mode (read, write, etc.)
  • Outputs:
    • eax : s32 = file descriptor

This is impossible to model with only a single type1, and we also want the compiler to insert mv 0x02, %r0 when trying to syscall 2 (otherwise it would be redundant).
In order to model the inputs and outputs, we can parse structures akin to the typechecker's contexts (in typing rules, the context is denoted Ξ; Γ; χ; σ; ε and contains all information we need) and use those within the typechecker2.
As for the instruction to automatically generate, we will support macros expansion, all non-terminal N⋆ instructions as well as a special interrupt N instruction3.


As such, we propose this syntax for syscalls:

non-terminal-instructions := instruction ";" non-terminal-instructions | instruction
context := "(" XI ";" GAMMA ";" CHI ";" SIGMA ";" EPSILON ")"
syscall := "syscall" number ":" context "->" context "=" non-terminal-instructions

where:

  • XI and GAMMA are comma-separated liists of bindings label: type
  • CHI is a comma-separated list of bindings register: type
  • SIGMA is a stack type
  • EPSILON is a continuation type

In the typechecker, each part of the context will be unified with the internal context and used as such. System calls modify the internal context as specified by context → context. The first context is the precontext, meaning it describes all the inputs needed for the syscall to operate correctly, and the second context is the postcontext, meaning it describes what the syscall gave us back.
One may bind a register to ! in the postcontext to allow forgetting, or in the precontext to mark callee-saved registers.

Footnotes

  1. It is possible, and this is what is done with labels and continuations, but this is not great. System calls do not jump around (they would need to be terminal instructions, which is a bit unsatisfactory), and having continuations for system calls does not seem to quite make sense.

  2. We will need to check the consistency of the rules. A simple rule of thumb is that all variables on the left of the arrow must also appear on the right, and no variable on the right is not present on the left. This prevents the user from writing the rule (Ξ; Γ; χ; σ; ε) → (Ξ'; Γ'; χ'; σ'; ε') where all the primed versions are fresh identifiers not bound on the left of the arrow.

  3. The interrupt N instruction exists in order to generate the correct code for the INT (or INT-like) instructions. These codes depend on the kernel, not the CPU, so we cannot generate them when compiling N⋆ instructions to machine code. We could also have another syntactic construction to specify which INT code to use for system calls (something like syscall → X?). Further system call definitions would then be augmented by an implicit interrupt X.

In a sense, syscalls as presented in the comment above are like (unsafely1) typed macros.

Footnotes

  1. Unsafely because interrupt does not have a clear type, and also because these macros' types do not live in the grammar itself.

See https://syscalls32.paolostivanin.com/, https://syscalls64.paolostivanin.com/ or https://syscall.sh/ for system call codes and arguments on Linux.
Note that all system calls return an integer value in %rax (but we may choose to discard it).

The interrupt N instruction exists in order to generate the correct code for the INT (or INT-like) instructions. These codes depend on the kernel, not the CPU, so we cannot generate them when compiling N⋆ instructions to machine code. We could also have another syntactic construction to specify which INT code to use for system calls (something like syscall → X?). Further system call definitions would then be augmented by an implicit interrupt X.

Instead of this, and putting syscall definitions at the top-level, we can unify both in a “section” containing the interrupt number.
Something like

syscall 0x80 {
  # exit
  syscall 60: ... → ... = ...
}

That way, there's no problem related to forgotten syscall → N declarations. And we also gain that the N is now scoped (well we should disallow multiple syscall sections with different Ns).

There is no need to repeat syscalls within the block (because the block itself is only for syscall declarations).