genodelabs/genode

x86: generalized virtualization interface

Closed this issue ยท 59 comments

Currently, on x86 we support hardware assisted virtualization for Genode/hw running as subject on the Muen hypervisor and for Genode/NOVA as used for operating Sculpt OS.

Last year (2018) we had the rough idea, as stated on the roadmap, to extend our own -hw- kernel also with x86 virtualization features. Beside the required extension of the -hw- kernel, on top we need a VMM to actually handle/operate the virtualization events/exits. Either we have to develop a own VMM for -hw- or re-use existing VMMs.

Up to now, on Genode/NOVA we support 3 VMMs - namely Seoul, Virtualbox 4 and Virtualbox 5. On Genode/hw/Muen solely Virtualbox 4 is supported. Unfortunately, the VMMs are currently tight hard to the specific platform - so just reusing the VMMs for -hw- is not possible at the moment.

Though, as kind of preparation step to empower the -hw- kernel with virtualization extension, a generalized Genode virtualization interface is desirable with the goal to re-use our existing VMMs on x86.

In the course of this feature topic also other supported x86 kernels, like seL4 and Fiasco.OC, may be considered - depending on the effort encountered.

I pushed a first draft of the commits. Please have a look/review at the first commits. The kernel specific commits of nova/foc/sel4 are not ready - I just pushed them that you get an idea of how the actual implementation of the interface looks like. Depending on changes of the vm_session/base commits they are subject to change.

@nfeske: i reworked the extension commit of the vm_session interface, please review da06421. As soon as it get acknowledged and gets to staging I may continue with the fellow up commits.

da06421 looks very good. Thanks for the careful rework. I merged the commit to staging, with the muen part shaved off (as it was just removed with VBox4).

Thanks for taking care, I wanted to clean up the commit today ;-)

I rebased and re-tested on hardware the nova part (vmm_x86.run) on the last master branch. Can you please review the 4 commits of https://github.com/alex-ab/genode/commits/issue_3111_19_02, so that I may continue and merge your change requests.

Questions after looking at the first commit:

What's the rationale behind the split between Vm_state_raw and Vm_state? Please document. If the split is needed, why have you used inheritance (complicated), not aggregation (simple)?

Please try replacing the preprocessor macros vm_register, vm_segment, vm_range by structs, possibly defined within the scope of Vm_state (Vm_state::Reg, Vm_state::Segment, Vm_state::Range). Macros should be avoided whenever possible. The only acceptable use case for them is reflection (#name to obtain a string literal of a macro argument), which I don't see used here.

When using plain structs w/o constructor, one can work with the values like tuples, which is less prone to wrong argument ordering. Just sketching:

/* within the definition of Vm_state */
struct Reg { bool used, updated; };
...
Reg ax;
...
...
/* in the code */
Vm_state vm_state { };
vm_state.ax = { .used = false, .updated = false };

When removing the split between Vm_state_raw and Vm_state, Reg could be a template taking the register value type as argument. I would like this.

BTW, the first letter of struct/class names should be upper case (Genode coding style).

Don't use C bitfields where bit-shaving is not needed, use bool values.

What's the rationale behind the split between Vm_state_raw and Vm_state? Please document. If the split is needed, why have you used inheritance (complicated), not aggregation (simple)?

I want to enforce that C++ accessors are used for all members. That way it can be tracked easily whether the underlying kernel actually supports the required specific register (which is not the case in general by all the use kernels).

Therefore, using a struct for plain access I wanted to deny.

Are you fine with transforming the struct Vm_state in a class with private members then ?

That way it can be tracked easily whether the underlying kernel actually supports the required specific register

The VMM is expected to evaluate the used flag per register to probe for the degree of kernel support? Wouldn't a couple of (a few) explicit boolean members (e.g., in the form of tsc_offset_supported) be more straight forward and more expressive at the same time? This way, a VMM could reasonably know what varieties are to be expected.

Regarding the use of private members, I'd keep such bureaucracy at this low level as small as possible and therefore avoid it.

You left my question about the split unanswered. Do you agree to merge the Vm_state_raw with Vm_state?

The VMM is expected to evaluate the used flag per register to probe for the degree of kernel support?

It may so to make sure not to corrupt the guest state (by not having the actual register values).

On the way back from VMM to kernel, by evaluating the used value also selective guest state update are possible. Especially seL4 would cause quiet some overhead due to extra syscalls per Guest state register. So it is also a performance issue which can be avoided easily.

Wouldn't a couple of (a few) explicit boolean members (e.g., in the form of tsc_offset_supported) be more straight forward and more expressive at the same time? This way, a VMM could reasonably know what varieties are to be expected.

I fear it does not cover all cases. E.g. depending on the guest actually not all registers are needed, e.g. r8-r15. If you just deny to run a guest just because the underlying kernel says so in the beginning, you will not even get up your 32bit guest (e.g. seL4)

You left my question about the split unanswered.

Sorry, no - I answered it by the suggestion to put everything in a Vm_state class with accessors - which means: yes.

Thanks for the description. So if I understand correctly, there are some registers that are optional where others are not.

For the optional registers you want to distinguish the case where its value has a meaning from the case where it is not supported or not set (so the value remains undefined). Now I understand better.

What do you think about modelling such a register as follows? (valid is just a place holder. It could be updated or similar):

template <T>
class Optional_register
{
  private:

    bool _valid { false };
    T    _value { 0 };

  public:

    Optional_register() : _valid(false), _value(0) { }

    Optional_register(T value) : _valid(true), _value(value) { }

    T    value()  const { return _value; }
    void value(T value) { _value = value; _valid = true; }

    bool valid() const { return _valid; }
}

This would be in the spirit of your original macros but would use C++ only. The Vm_state would stay free of private members, and the Optional_register could be used solely where the register is actually optional.

Note that the implementation above still has the risk that T value() would return a value even if _valid is false. To make this waterproof, it may be sensible to replace it with a method like this (but I don't know how this fits with the VMM code):

template <typename FN>
void with_value(FN const &fn) const
{
  if (_valid)
    fn(_value);
}

The functor argument is called only if the register has a valid value. This is similar to how the Input::Event interface works (https://github.com/genodelabs/genode/blob/master/repos/os/include/input/event.h).

Thanks, i will come up with a proposal for the implementation.

@nfeske: please review cb83680

That looks excellent!

Ok, then i will continue to adjust the other commits.

I adjusted the nova implementation to the changed Vm_state class and tested it on hardware as also Qemu. Please review the commits on https://github.com/alex-ab/genode/commits/issue_3111.

Regarding the implementation of Vm_state::reset, wouldn't the following do the trick? (If Register had a default constructor that sets _valid to false, of course)

*this = Vm_state { };

Since this is quite superficial, the reset method could be dropped, actually.

Commit 0911a49 looks very nice. I like your attention to detail with describing the intention of the code (comments, naming). Even though the topic is complex, I'm not feeling lost.

Commit cb7f768 looks tidy and neat. Please look out for the inline remarks. Of course, my review remains shallow at this point.

What is the plan about the commented out code portions? (#if 0)

Regarding the implementation of Vm_state::reset, wouldn't the following do the trick? (If Register had a default constructor that sets _valid to false, of course)

*this = Vm_state { };
Since this is quite superficial, the reset method could be dropped, actually.

Will try to do so.

What is the plan about the commented out code portions? (#if 0)

Will be just removed.

@nfeske: I incorporated your comments so far and added a working version for Fiasco.OC ( https://github.com/alex-ab/genode/commits/issue_3111). Will continue to re-enable the sel4 version.

The commits look very good now. Thanks for incorporating my remarks. I'd be ready to merge the commits tomorrow, if you agree.

@alex-ab Should I go forward with merging the current commit series to staging?

@nfeske: My feeling is yes. I will then just come up with fixup commits if necessary.

Done. Crossing fingers about what the the autopilot will tell us tomorrow. ;-)

Added a working version for seL4 by d14f568 and 285d017. Admittedly, i will have to come up for foc and sel4 with some fixups regarding the left XXX

Commit alex-ab@d14f568 is an impressive piece of work. And it reeks of anecdotes. ;-)

I have nothing substantial to add. It's good to see that you adapted the use of lambdas as scope-local functions, e.g., in your rework of the Vm_space code. In my opinion, it helps to digest the complexity.

It's cool that the back ends for the different kernels are covered by nicely separated commits. Given the fact that the seL4 virtualization support is less complete than NOVA's, the difference between the nova and sel4 commits - especially the Vcpu implementation - is quite telling.

I merged both commits to staging. I'm a little curios if the virtual-memory-management changes affect the existing (non-VT) tests.

Admittedly, i will have to come up for foc and sel4 with some fixups regarding the left XXX

Don't worry. That's what we have staging for.

a9991fe and 5d21f7b are the proposal to implement the dataspace tracking and free-up in core via the vm_session interface. At the momemnt solely nova and hw are part of the commit (sel4 and foc is WIP).
@nfeske: may you please have a look whether the general direction is ok
@skalk: please check the hw part

@alex-ab Yes, a9991fe looks very good to me. It reflects perfectly our offline discussion we had yesterday.

The branch https://github.com/alex-ab/genode/commits/issue_3111_dataspace contains 4 commits which implement the tracking and freeing of dataspaces used via the vm_session for foc/sel4/hw/nova.

skalk commented

@alex-ab I've tried to follow the changes as far as possible. Apart from the one question I've asked inline, I have no objections.

Please review the commits on https://github.com/alex-ab/genode/commits/issue_3111_dataspace and consider for staging if applicable.

Regardless of the minor inline remark, the commit series looks very good to me. Thanks @alex-ab for improving the allocator-avl as a side effect of your work.

I'd be fine with both merging the series as is and attempt the code-deduplication at a subsequent step, or you refining the commit 58cee21 now. Alex, please decide.

I am fine to take the time now to remove some of the code-duplication.

@nfeske: I adjusted the commit message and added 4d7d012 which factors out the common part. Please re-check, thanks.

Excellent! Thanks for the super quick response.

Should 4d7d012 be squashed to 58cee21 to eliminate the intermediate state from the history?

You may try, yes. It could be that you get merge conflicts with the intermediate commits, possibly.

There are no merge conflicts. :-)

I merged the series with the squashed commits to staging now.

The commits on https://github.com/alex-ab/genode/commits/issue_3111_seoul_staging are ready for a review and staging if appropriate.

Please find my comments annotated at the commits. The last three commits could be squashed into one, in my opinion.

@nfeske: please review the adjusted branch considering your comments, https://github.com/alex-ab/genode/commits/issue_3111_seoul_staging

Thanks for reworking the commits so quickly.

I find the seoul commit delightful to look at. A lot of fairly obscure code (like the utcb locking, or the constraints of the local virtual address space) is replaced by code that is easy on the eyes. Congratulations for this very fine line of work!

Now that you dropped the NOVA-specific optimizations, do you observe a (subjective) performance difference, e.g., when running the tinycore-browser VM?

Up to now I could only test a quite outdated test VM with browser (seoul-fancy) and there it seems to be fine.

ad5053f fixes the seoul package build

@alex-ab I merged your commits to staging but now when seeing the commits below your last message, I wonder if I picked the right branch. I used the commits from your issue_3111_seoul_staging branch. Is this fine? If not, can you point me to the most current branch?

All fine, you took the right branch.

Please consider 16add85 and a02c048 for staging

Good progress this week, on Genode@NOVA our Ubuntu 32/64bit VMs and Windows 7 32bit VM with multiple vCPUs are functional via the new Genode VM session interface

With the commits on https://github.com/alex-ab/genode/commits/merge_to_staging the support of vbox5 with the new vm_session on nova is functional. The sel4 and foc version I don't think to get ready until the release. As discussed during one of the staff meetings, I added the vbox5 vm_session version as separate target, so that it does not interfere with the current vbox5-nova only version.

@alex-ab I merged your commit series to staging.

A bit too late for the release, but: Genode/Fiasco.OC now runs Ubuntu (32bit, 1 vCPU) and Windows 7 (32bit, 1 vCPU) with kernel-agnostic VBox5 on a devel branch

Wow!

๐Ÿ‘ Windows 7 on Fiasco.OC ;-)

skalk commented

Unfortunately, there are no sounds on Github, otherwise I'd like to play the "impressive" sound from quake for you.

@chelmuth, @nfeske: please consider 2be2285 ebae769 f6afc98 for staging (issue_3111_staging branch)

Merged the series.

@chelmuth: please add cfe5b0d 91bc64e f38fa0a to staging

The VMMs are kernel independent now, so I'm closing the issue. For enablement/improvements of specific base-<kernel> platforms we should open separate issues.