WICG/turtledove

Bidding worklet performance limitations

barteklos opened this issue · 12 comments

Hi,

We have started experimentation with the current FLEDGE implementation in Chromium. As part of this, we have provided end-to-end functional and performance tests.

For this issue we would like to discuss the bidding worklet's performance limitations in the context of potential bidding logic. To give an example, our production generateBid() implementation could evaluate a feed-forward neural network with 3-4 layers (repeated for 5 different ML models) and then it would look like this:

generateBid(interestGroup, auctionSignals, perBuyerSignals, trustedBiddingSignals, browserSignals) {

   const nn_model_1_weights = [
       [[1.23, 3.14, 2.7...], [100.1, 100.2,...], ...], // 200x200 matrix
       [...], // 200x100 matrix
       [...], // 100x50 matrix
       [...], // 50x1 matrix
   ]; // hard-coded weights for the 1st model (eg. CTR, CR, CV)

   const nn_model_2_weights = [...]; // hard-coded weights for the 2nd model

   const nn_model_3_weights = [...]; // hard-coded weights for the 3rd model

   const nn_model_4_weights = [...]; // hard-coded weights for the 4th model

   const nn_model_5_weights = [...]; // hard-coded weights for the 5th model

   let input = extractFeatures(interestGroup, auctionSignals, perBuyerSignals, trustedBiddingSignals,
browserSignals); // vector of 200 floats

   let bid = nn_forward(input, nn_model_weights_1) * nn_forward(input, nn_model_weights_2)
                * nn_forward(input, nn_model_weights_3) * nn_forward(input, nn_model_weights_4)
                * nn_forward(input, nn_model_weights_5);

   let ad = ... 

   let renderUrl = ...

   return {'ad': ad,  'bid': bid, 'render': ad.renderUrl};
}

where extractFeatures() extracts vector of 200 features (from signals and interest group’s data) and nn_forward() is:

nn_forward(input, nn_model_weights) {
    let X = input; // vector of 200 floats
    X = relu(multiply(nn_model_weights[0], X)); // nn_model_weights[0] - 200x200 matrix
    X = relu(multiply(nn_model_weights[1], X)); // nn_model_weights[1] - 200x100 matrix
    X = relu(multiply(nn_model_weights[2], X)); // nn_model_weights[2] - 100x50 matrix
    X = relu(multiply(nn_model_weights[3], X)); // nn_model_weights[3] - 50x1 matrix
    return X[0];
}

This is an extremely simplified version of generateBid() and focuses on multiplying the input values by the hard-coded model weights. We can expect a lot of additional boilerplate code (choosing the best ad, model feature extraction, capping & targeting logic, brand safety etc.) around this but even such a simple example is enough to illustrate performance limitations for the current implementation.

We have results from benchmarks for two different environments running the same generateBid() function:

no. test environment code run as time spent on generateBid()
1 V8 engine with jit tight loop with a warm-up 1.12 ms
2 bidding worklet (with its limitations: jitless etc.) buyer’s js 55.68 ms

In conclusion, we can see a significant performance drop (almost 50x) for a bidding worklet compared to an optimal environment. What is more, we can easily exceed the worklet’s timeout (which is 50 ms) for the mentioned use case.

Do you have any thoughts on how to optimize generateBid() code in such an execution environment? Are there any plans to provide a more effective bidding worklet?

Best regards,
Bartosz

We would like to follow up this issue and discussion we had on the same topic.

As suggested:

  • we put our patch into the Chromium repository,
  • we replaced node.js by V8 in our benchmark 1 to make it suitable for comparison.

We have added some additional functional and performance tests to our framework. In particular, we have done benchmarks with V8 which run generateBid() and compare usage of webassembly and lack of it:

no. test environment code run as time spent on generateBid()
3 V8 engine without wasm buyer’s js 54.12 ms
4 V8 engine with wasm buyer’s js with wasm binary hardcoded 4.93 ms

The performance seems to be good enough in the mentioned scenario and we believe that similar results could be achieved in a bidding worklet.

In benchmark with webassembly we hardcoded wasm binary and instantiated it in generateBid() which means that it could be improved by:

  • preloading binary (wasm) resources (A),
  • caching compiled webassembly modules (B)

which would reduce an additional 1.35 ms in that case.

All in all, is it an option to provide a bidding worklet implementation with support for webassembly? If so, is it an option to provide some API extensions to achieve (A) and (B) ?

We can see some additional benefits related to such a support (better performance for an inefficient hardware, reducing time of the script initialization and model weights parsing, additional obfuscation, potentially easier migration of the current code, SIMD operations availability etc.).

@JensenPaul, a friendly ping for visibility.

  • I can overall confirm the qualitative execution time findings; our current suspicion is this is because the isolation between executions means some things are not cached between executions.
  • There is a further problem of expensive parsing time of those large scripts; this in particular hurts the otherwise attractive WASM version, since while its execution time is good, the parsing time isn't. V8 does seem to cache the parse within the process (but this is not persistent).
  • I am planning on seeing whether using compilation caching similar to how blink does will help; but please note there is a lot of thinking to be done on whether this can meet the isolation properties we want, and that it of course doesn't help the first time something is run. First step is a non-persistent toy prototype just to see if it helps or not.

Thank you @morlovich for looking into the issue.

For the record, we have provided another patch which turns on webassembly in Chromium, so we were able to run a similar benchmark in a bidding worklet:

no. test environment code run as time spent on generateBid()
5 bidding worklet (with wasm support) buyer’s js with wasm binary hardcoded 6.07 ms

In this benchmark, the bidding worklet spends time on:

  • initializing js: 3.81 ms,
  • calling js: 2.25 ms (which includes compiling wasm: 2.04 ms and calling it: 0.21 ms).

Just in case you are not aware of this, we would like to share our findings:

  • Our first attempt to run a benchmark with webassembly in Chromium was not successful. There was a significant difference between V8 and the bidding worklet, mainly in the case of wasm, and our test case took over 26 ms. It was because we were compiling Chromium with default flags which add debug asserts. The solution was to build Chromium with dcheck_always_on=false. The official raw build of Chromium seems to have the same overhead but Chrome release uses is_official_build=true, which also turns off these debug asserts.

  • Chrome supports caching compiled webassembly modules and we were wondering if a similar mechanism could be used in case of the bidding worklet (reference: this blog post). It requires using WebAssembly.compile and WebAssembly.instantiate APIs (which are async) and storing compiled wasm modules in DB. Do you have in mind a similar approach?

So you're measuring something very different from me for parsing numbers; are you including the time in AuctionV8Helper::Compile?

Right, AuctionV8Helper::RunScript runs:

  • v8::Script::Run (which is js initialization, not js parsing!) and it takes 3.81 ms,
  • v8::Function::Call (which is js call) and it takes 2.25 ms.

I have edited a previous comment to avoid confusion.

AuctionV8Helper::Compile takes 292.59 ms in this scenario. I did not take this into account mainly because the bidding worklet’s timeout does not include time of js loading and js compiling. However, I must admit that in the case of a huge js script (with model weights or wasm binary hardcoded) it could have some impact on overall performance, especially that AuctionV8Helper::Compile is called twice, in the context of generateBid and reportWin, for every auction. Script itself could be potentially cached by network layer but compiled js would not be cached in the current implementation.

A table below shows adjusted results for benchmark 2 and benchmark 5 (run with a new Chromium build without debug asserts):

no. time spent on AuctionV8Helper::Compile (not included in timeout) time spent on AuctionV8Helper::RunScript (included in timeout)
2 86.88 ms 53.56 ms
5 292.59 ms 6.07 ms

Thanks for confirming this. Basically, my current plans are sort of based around the ::Compile column, since I feel like WASM execution time is probably good enough, at least initially. There are basically two current directions:

  1. The sort of caching I mentioned a few comments back. At that point I thought --- incorrectly --- it would help RunScript time, but it doesn't. It helps with the ::Compile time only, though cache entries for such scripts are also somewhat unwieldy, and I don't have a good idea of the costs of them.
  2. Providing a better way to encode WASM. The option I am trying right now is to try a base64 encoding that you just pass to WASM.compileStreaming. I don't know how well it actually runs, since I also need to figure out how to deal with async functions correctly. The compileStreaming API has its own caching interface, too, and that one does cache compilation and not just parsing which may also be worth trying, though things may be good enough without it.

P.S. The things in model_weights.h could probably use some 'const's, though I don't know if it makes any difference for WASM.

So to update you, 99.0.4781.0 canary has JIT on.
I've also have a lot of pieces of better WASM support in place, and some more in progress, though it's not hooked up to the actual API. (I tested it on your benchmark by editing the InterestGroups database with sqlite command line, with https://chromium-review.googlesource.com/c/chromium/src/+/3349585 patched in).

The overall shape basically is that when registering an interest group one can (if desired) provide a URL to a binary wasm file (like functions.wasm in your benchmark), and the Javascript gets a WebAssembly.Module passed in. That avoids the bad parse times, since the binary WASM format is way better at storing large amounts of data than matrices, and the performance is OK.

https://chromium-review.googlesource.com/c/chromium/src/+/3353189/ also adds a preliminary API. Pass in biddingWasmHelperUrl when joining the interest group, get a WebAssembly.Module of it as browserSignals.wasmHelper in your generateBid. (Names all might change, haven't done the proper proposal stuff for it yet)

FWIW: #250

Thank you @morlovich for your update and efforts to provide WASM support in the bidding worklet.

The mentioned draft of API suggests that you consider the most optimal scenario. Indeed, the lack of a hardcoded WASM binary in JS and passing the compiled WASM module directly into generateBid should reduce the time taken to initialize and call JS according to our benchmarks.

Thanks. It's in 99.0.4834.0

Closing as JIT and WASM support were added. Feel free to reopen if you have more questions.