An idea - a `Runner` that owns the input and output arrays and freezes shapes and names

Question

An idea - a `Runner` that owns the input and output arrays and freezes shapes and names

aldanor opened this issue 4 years ago · 5 comments

Bottom line - there's tons of overhead in run() currently:

It allocates something like 15 Vec instances and a bunch of strings, there's tons of allocations all over the place (so for small inputs and graphs this is noticeable)
For big inputs, you are currently required to copy the data in
There's a lot of overhead like building names vecs (should be done upon model load?) and shapes (if there's no dynamic axes, no need to do that repeatedly)
There's allocations for outputs as well

Here's one idea, what if you could do something like this (I think this way you could bring the overhead down to almost zero).

// maybe I've missed something, would like to hear your thoughts, @nbigaouette :)

// note that this is all simplified, as it may require e.g. Pin<> in a few places
struct Runner {
    session: Session,
    inputs: Vec<Array<...>>>,
    // owned preallocated outputs as well?
    input_names: Vec<CString>,
    output_names: Vec<CString>,
}

impl Runner {
    fn from_session(session: Session) -> Self { ... }

    pub fn execute(&mut self) { ... }

    pub fn outputs(&self) -> &[Array<...>] { ... }

    pub fn inputs(&mut self) -> &mut [Array<...>] { ... }
}

let mut Session: session = ...;
let input_arrays: Vec<...> = ...;

// this executes most of what `run()` currently does, all the way up to the actual .Run() call
let runner = session.into_runner(input_arrays);

runner.execute()?; // this just calls Run() and converts the status

// if outputs are preallocated, no extra allocations here either
for out in runner.outputs() {
    dbg!(out);
}

// no allocations, no boilerplate, we're just updating the inputs
runner.inputs()[0].fill(42.0);

// no allocations, no boilerplate, just a .Run() call
runner.execute()?;

Answer 1 · 2020-11-28T03:09:54.000Z

In fact, I think that the current run() can be probably even expressed in terms of the above. This may require it to hold a mut reference to the session though, i.e. Runner<'a'> { session: &'a mut Session, ... .

So to retain current API you could have

impl Session {
    pub fn run(&mut self, inputs: Inputs) -> Result<Outputs> {
        let runner = Runner::new(self, inputs)?;
        runner.execute()?;
        Ok(runner.into_outputs())
    }
}

(As noted in #39 though, things like caching names should be probably done outside of all of this anyway upon model loading; also precaching shapes when there's no dynamic axes etc).

Note: the above will probably not compile because of potential multiple mutable borrows etc, but that's technical details - can be made to work with a bit of munging and shuffling, I just tried to make the general idea clear.

Answer 2 · 2020-11-28T03:18:12.000Z

In fact, to think about it further, Runner is almost like a Session where the input shape is known. Then basically we can preallocate everything including inputs and outputs and all pointers can be frozen. Also you don't even need to pass input arrays to create a runner, it can just zero-initialise some inputs; it just needs the shape(s).

I think, in most practical cases where the speed of execution would be critical (realtime apps), input shape would almost always be frozen and known in advance, so all dimensions would be fully known and the only thing that would change would be the inputs themselves (e.g. receiving frames from a camera, etc).

Answer 3 · 2020-11-28T17:58:28.000Z

I have a prototype that I can try to push tonight - so, in brief, it reduces execution time of a tiny graph with a few nodes from 15us to 8us, so almost 2x speedup, plus there's no more extractors, no allocations, no copies or clones (as suggested above).

Answer 4 · 2020-12-02T14:42:52.000Z

This seems like a good fit for my use case: a service that loads precisely one .onnx file, and then feeds data from each request through the resulting session.

Answer 5 · 2020-12-02T16:00:28.000Z

@marshallpierce See #41 for a preliminary working implementation