aptos-foundation/AIPs

[AIP-X] Compiler release levels for optimal inlining

Closed this issue · 2 comments

@wrwg @davidiw @junkil-park @runtian-zhou @brmataptos

Overview

This AIP proposes compiler release levels, akin to those of cargo, such that
Move code can be tuned for minimum execution cost in production, and quick
compilation during testing.

Generally, there is a tradeoff between the size of the package and how expensive
it is to execute: more repeat code and manually-inlined functions result in less
gas, but more bytecode. Hence a developer who wishes to practice function
abstraction during development may be later forced to de-abstract code into long
repetitive blocks in order to save gas, simply because abstracting their code
results in post-compilation runtime inefficiencies.

Generally, this problem could be solved via two release flags:

  1. --release: inlines as much as possible within or even across packages,
    eliminating duplicate operations and consolidating logic to single function
    calls on the stack.
  2. --debug: compiles code as quickly as possible, without any optimizations,
    in the interest of quick testing, evaluating coverage, etc.

Example

Sample module

Consider the below sample module:

module template::main {

    struct Value has key { inner: u8 }

    public entry fun add_then_multiply(val_address: address, add_amount: u8, multiply_amount: u8)
    acquires Value {
        add(val_address, add_amount);
        multiply(val_address, multiply_amount);
    }

    public entry fun add(val_address: address, amount: u8) acquires Value {
        assert!(exists<Value>(val_address), 0);
        let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
        *val_ref_mut = *val_ref_mut + amount;
    }

    public entry fun multiply(val_address: address, amount: u8) acquires Value {
        assert!(exists<Value>(val_address), 0);
        let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
        *val_ref_mut = *val_ref_mut * amount;
    }
}

Effective compilation, no inlining

In the current paradigm, add_then_multiply effectively compiles as if the
original source were:

module template::main {

    struct Value has key { inner: u8 }

    public entry fun add_then_multiply(val_address: address, add_amount: u8, multiply_amount: u8)
    acquires Value {
        //
        // Function call on the stack.
        //
        assert!(exists<Value>(val_address), 0);
        let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
        *val_ref_mut = *val_ref_mut + add_amount;
        //
        // Function call on the stack.
        //
        assert!(exists<Value>(val_address), 0);
        let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
        *val_ref_mut = *val_ref_mut * multiply_amount;
    }
}

Inlining complications

The gas costs associated with function calls can be eliminated by inlining, but
the implementation will be complicated because there still need to be public
entry functions around the inlined inner functions to preserve the ABI:

module template::main {

    struct Value has key { inner: u8 }

    public entry fun add_then_multiply(val_address: address, add_amount: u8, multiply_amount: u8)
    acquires Value {
        add_inner(val_address, add_amount);
        multiply_inner(val_address, multiply_amount);
    }

    public entry fun add(val_address: address, amount: u8) acquires Value {
        add_inner(val_address, amount);
    }

    inline fun add_inner(val_address: address, amount: u8) acquires Value {
        assert!(exists<Value>(val_address), 0);
        let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
        *val_ref_mut = *val_ref_mut + amount;
    }

    public entry fun multiply(val_address: address, amount: u8) acquires Value {
        multiply_inner(val_address, amount);
    }

    inline fun multiply_inner(val_address: address, amount: u8) acquires Value {
        assert!(exists<Value>(val_address), 0);
        let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
        *val_ref_mut = *val_ref_mut * amount;
    }
}

However, add_then_multiply is still not optimized because the inlined code
results in repeated operations, effectively:

module template::main {

    struct Value has key { inner: u8 }

    public entry fun add_then_multiply(val_address: address, add_amount: u8, multiply_amount: u8)
    acquires Value {
        assert!(exists<Value>(val_address), 0);
        let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
        *val_ref_mut = *val_ref_mut + add_amount;
        assert!(exists<Value>(val_address), 0);
        let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
        *val_ref_mut = *val_ref_mut * multiply_amount;
    }
}

Optimal compilation

Ideally, for --release, the compiler would optimize out the double
existence/borrow calls, without requiring the developer to create inline
functions, as if the original source code were effectively written:

module template::main {

    struct Value has key { inner: u8 }

    public entry fun add_then_multiply(val_address: address, add_amount: u8, multiply_amount: u8)
    acquires Value {
        // After release optimizations.
        assert!(exists<Value>(val_address), 0);
        let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
        *val_ref_mut = *val_ref_mut + add_amount;
        *val_ref_mut = *val_ref_mut * multiply_amount;
        // As written:
        // add(val_address, add_amount);
        // multiply(val_address, multiply_amount);
    }

    public entry fun add(val_address: address, amount: u8) acquires Value {
        // As written.
        assert!(exists<Value>(val_address), 0);
        let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
        *val_ref_mut = *val_ref_mut + amount;
    }

    public entry fun multiply(val_address: address, amount: u8) acquires Value {
        // As written.
        assert!(exists<Value>(val_address), 0);
        let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
        *val_ref_mut = *val_ref_mut * amount;
    }
}

Logical branching

Another related optimization is the consolidation of logical branching:

module template::main {

    struct Value has key { inner: u8 }

    public entry fun add_then_multiply(
        val_address: address,
        add_amount: u8,
        multiply_amount: u8,
        do_it: bool,
    ) acquires Value {
        // After release optimizations.
        if (do_it) {
            assert!(exists<Value>(val_address), 0);
            let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
            *val_ref_mut = *val_ref_mut + add_amount;
            *val_ref_mut = *val_ref_mut * multiply_amount;
        }
        // As written:
        // add(val_address, add_amount, do_it);
        // multiply(val_address, multiply_amount, do_it);
    }

    public entry fun add(val_address: address, amount: u8, do_it: bool) acquires Value {
        // As written.
        if (do_it) {
            assert!(exists<Value>(val_address), 0);
            let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
            *val_ref_mut = *val_ref_mut + amount;
        }
    }

    public entry fun multiply(val_address: address, amount: u8, do_it: bool) acquires Value {
        // As written.
        if (do_it) {
            assert!(exists<Value>(val_address), 0);
            let val_ref_mut = &mut borrow_global_mut<Value>(val_address).inner;
            *val_ref_mut = *val_ref_mut * amount;
        }
    }
}

I have to say, current compiler is unable to unpack vector then pack to another user defined struct without overhead, that I think somewhat possible in move assembly could without overhead.

struct ABC has copy, clone {
 a1: u8,
 a2: u8,
 a3: u8,
 a4: u8,
}

public entry fun test(s:&signer, data: vector<u8>)

This one always producing overhead:

let (a, b, c,d) = data;
let f = ABC {a1: a, a2: b, a3: c, a4: d};

But what if we could like that, could be do Move VM stack data management, will be zero code overhead:

unpack<vector<u8>>(data);
let f = pack<ABC>();

At this moment, variable "data" must exactly length at 4, that defined by struct ABC. We can't do length check for vector<T> in compile time.

By that, we could one time unpack data, then multiple time to pack different data, or manipulate data each by each, or directly call any function, then any returns go to Move VM stack, then keep call function keep call function that all function calls only touch Move VM stack data.

Closed due to inactivity