Redesign
npmccallum opened this issue ยท 2 comments
๐ง๐ง๐ง WORK IN PROGRESS ๐ง๐ง๐ง
Problems with the Current Implementation
As it stands today, sallyport works but suffers from a number of shortcomings.
-
Only a single syscall can be proxied to the host per guest exit. This means that every syscall proxied to the host has to pay the full penalty. However, it would also be desirable to support batching or "smuggling" (i.e. implicitly conjoining two unrelated syscalls). Batching isn't as desirable currently. However, smuggling can be used to perform time related syscalls to update an internal map between an instruction counter and the system time. This would dramatically decrease the time spent for each keep current time request.
-
There is no way to tell, from the sallyport itself, if the sallyport contains valid syscalls to execute during a keep exit. This means that the main loops have to divine from other strategies whether the contents of the sallyport are valid to process. This is not good for safety.
-
Having an allowlist of syscalls for the host side is currently left up to the keep loader backends. These currently allow all syscalls. It would be better if sallyport would know all its implemented syscalls and only permit the ones it knows about.
-
Address translation is currently left up to the backend to implement. This results in strange code such as in
kvm
where the guest has knowledge about the host addresses in order to translate the addresses correctly. This is terrible for security since it means that the host would have to validate that all of the addresses are safe.
Design Requirements
-
Like before, sallyport needs a defined mechanism for holding both data viewable by the host as well as a facility for passing syscall requests.
-
Sallyport should support multiple syscalls at once in the sallyport.
-
The host should be able to inspect sallyport memory in a safe way to determine how many syscalls are being requested.
-
Sallyport should encode all syscall pointers as an offset in the sallyport buffer. The guest knows its address for the sallyport, so it can convert this offset to an address. Likewise, the host knows its address for the sallyport, so it too can convert this offset to an address. This also makes pointer validation easy since pointers are only valid if they point into the sallyport.
-
Sallyport should provide host and guest interfaces for interacting with the sallyport. These interfaces should be rich and have knowledge of how to serialize each syscall. This allows us to effectively share syscall proxying code between technologies.
Data Types
This section will define the data types in the sallyport system. It will use trait
s or type aliases as a common way to express how the data types should look. However, this does not imply that these should *actually be traits or type aliases unless otherwise noted. Often they will just be structs that have implemented methods with similar signatures to the proposed trait
.
Word
A word is a pointer-sized integer. This is simply another way of expressing usize
.
Block
โโโโโโโโ
โ Size โ
โโโโโโโโค
โ โ
โ Data โ
โ ... โ
โ โ
โโโโโฌโโโค
โ โ โ
โ โผ โ
โ โฒ โ
โ โ โ
โโโโดโโโโค
โ Reps โ
โโโโโโโโค
โ โ
โ Reqs โ
โ โ
โโโโโโโโ
Fig. A
A sallyport block is an array of words (i.e. &[usize]
). Expressing the block in this way solves both alignment and code simplicity issues. However, this is really just a region of memory. (As you will see, when we allocate space for data in the block we will do .align_to::<u8>()
. So this is really just a block of bytes expressed as words.
The full form of the block is show in Figure A
above. At the top of the block is the size
word which identifies how many request/response pairs there are at the bottom of the block. Below the size
word is a region reserved for storing data. This data can be referenced in requests or responses by expressing the offset into the block for the data in question.
Note that during block content construction, the data and request/reply sections grow towards each other. Care MUST be taken to ensure these sections never overlap.
Request
type Request = [usize; 8];
A request is simply eight contiguous words. The first word defines the contents of the remaining 7 words.
SysCall
const SYSCALL: usize = 0;
If the first word contains a SYSCALL
value, the second word contains a syscall number and the remaining six words contain the platform's syscall registers in the platform defined order. For example, on x86_64-unknown-linux-gnu
the format of the request would be:
request[0] == SYSCALL;
request[1] == rax;
request[2] == rdi;
request[3] == rsi;
request[4] == rdx;
request[5] == r10;
request[6] == r8;
request[7] == r9;
All registers which contain pointer values MUST be expressed as an offset in the sallyport to the data rather than an absolute address.
Response
type Response = [usize; 2];
Each request is paired with its corresponding response. These can be correlated by their index (the first response correlates with the first request). The formatting of the response is platform and request type dependent. For example, on x86_64-unknown-linux-gnu
the response words are rax
and rdx
, respectively, and -errno
values are passed as the highest 4096 values in rax
.
Phases
The block has three states which correspond with three phases:
- Start - the guest begins forwarding requests to the host
- Exit - the host receives control from the guest
- Return - the guest receives control back from the host
We will define the contents of the sallyport at each state and then explain what the correlated phase of code must do to transition to the next state.
Start Phase
When the guest is ready to forward requests to the host, the sallyport contents are undefined. The guest begins by writing a 0
to the size
word. The block is now empty.
For each request to be forwarded over the sallyport, the guest should:
- Allocate and write any data needed to the data section.
- Append a new request to the request section.
- Increment
size
.
Once all data and requests have been written, the guest exits to the host.
Care MUST be taken by the guest to ensure that the data and request/response sections never overlap. This implies that the guest MUST leave sufficient space between the data and request sections for the host to write size
responses.
Exit Phase
When the host receives control from the guest, the host does not have knowledge of the contents of the data section. It only knows that the bottom of the block contains size
number of requests.
After the host sanity checks the size
value, it should call each syscall in order and write each response to the block. The host MUST NOT call the syscalls naively. It needs to evaluate all "pointer" types in the registers (expressed as offsets in the block) to ensure they point to the data region before converting them to proper host addresses. Then the host should perform any other syscall-specific validation before calling the syscall.
If an unknown syscall is requested, the host should respond ENOSYS
.
Return Phase
When the host returns control to the guest, the guest MUST NOT presume that any sallyport values are valid. Therefore, it MUST sanity check all the input values. For example, the guest should first validate that the size
value is unmodified. Likewise, it should validate that the requests section is unmodified.
After validation, the guest should iterate through request/response pairs, continuing to validate all nested input data. For example, if a "pointer" type isn't expressed as a valid offset in the sallyport, the guest should immediately stop all further execution.
Guest Side
Platform
trait Platform {
/// Suspend guest execution and pass control to host.
/// This function will return when the host passes control back to the guest.
pub fn sally(&mut self) -> Result<(), c_int>;
/// Validates that a region of memory is valid.
/// Returns a pointer if valid, otherwise `EINVAL`.
pub fn validate<T: Copy>(&self, ptr: usize, len: usize) -> Result<*const T, c_int>;
}
This is an actual trait. This is what we need each technology (i.e. kvm
, sgx
) to implement.
Handler
struct Handler(...);
impl Handler {
/// Create a new `Handler`.
pub fn new(block: &[usize], platform: impl Platform) -> Self;
pub fn attacked(&mut self) -> ! {
// Loop in case the host tries to reenter
loop {
// Try to exit...
self.exit(1);
}
}
/// Takes in the syscall registers, constructs the relevant
/// data types from them and calls the correct method below.
///
/// # Safety
///
/// This method is unsafe because it interprets registers to
/// the correct data types. However, in actual implementation
/// it might be safe if we can validate the inputs.
unsafe fn syscall(&mut self, registers: [usize; 7]) -> Result<[usize; 2], c_int> {
match registers[0] {
libc::SYS_read => {
let fd = registers[1] as _;
let ptr = self.platform.validate(registers[2], registers[3])?;
let buffer = from_raw_parts_mut(ptr, registers[3]);
[self.read(fd, buffer)?, 0]
}
...
}
}
/// Execute a read syscall...
pub fn read(&mut self, fd: c_int, buffer: &mut [u8]) -> Result<usize, c_int>() {
// Allocate buffer.len() bytes in the data section.
let offset_in_block = self.allocate(buffer.len());
// Append request
self.append(&[SYSCALL, libc::SYS_read, fd, offset_in_block, buffer.len()]);
self.leave();
// Validate return value
let responses = self.responses().collect();
if responses[0] > buffer.len() {
self.attacked()
}
return response_to_result(response[0]);
}
/// Other syscall methods...
pub fn write(&mut self, fd: c_int, buffer: &[u8]) -> Result<usize, c_int>();
...
}
The Handler
instance is the guest's interface with sallyport. The guest can execute a syscall directly. Or it can use the convenience method syscall()
to pass raw registers in from a syscall.
This is the redesign of the sallyport block we finalized today. It is only minor changes from Roman's newest PR.
===============================================
The sallyport block is a region of memory containing zero or more items. All items contain the following header:
- size:
usize
- kind:
usize
The size parameter includes the full length of the item except the header value. The contents of the item are defined by the value of the kind
parameter. An item with an unknown kind
can be skipped since the length of the item is known from the size
field. The recipient of an item with an unknown kind
MUST NOT try to interpret or modify the contents of the item in any way.
Kinds
END
:0
SYSCALL
:1
- ...
End
An END
item MUST have a size
of 0
. It has no contents and simply marks the end of items in the block. This communicates the end of the items list to the host. However, the guest MUST NOT rely on the presence of a terminator upon return to the guest.
Syscall
A SYSCALL
item has the following contents:
nmbr
:usize
- the syscall numberarg0
:usize
- the first argumentarg1
:usize
- the second argumentarg2
:usize
- the third argumentarg3
:usize
- the fourth argumentarg4
:usize
- the fifth argumentarg5
:usize
- the sixth argumentret0
:usize
- the first return valueret1
:usize
- the second return valuedata
:...
- data that can be referenced (optional)
The argument values may contain numeric values. However, all pointers MUST be translated to an offset from the beginning of the data section.
It is only minor changes from Roman's newest PR.
It is a minor change in terms of the protocol, but it raises complexity of the implementation quite a bit on both sides, so requires more work to be done (or unsafe workarounds).