/cgo-batching

CGO call batching benchmark

Primary LanguageGo

This is an experiment to measure the overhead of CGO calls. In particular it measures the difference in performance between standard CGO calls, and batched ones (e.g. make one CGO call that calls five C functions).

See discussion at azul3d-legacy/issues#17

Arguments

First, we can measure the cost of pushing arguments onto a stack which could have some copy overhead. Note that in this test CGO calls are still passing zero arguments (only batching is actually passing arguments) so with-CGO timings are not useful here:

Batch Size Calls stack-arguments With CGO With Batching
5 350000 0 45.688143ms 17.415343ms
5 350000 5 53.212283ms 26.057713ms
5 350000 10 46.108098ms 36.445277ms
5 350000 15 46.636866ms 44.606929ms
5 350000 20 46.52952ms 51.624302ms
5 350000 25 46.161806ms 59.803066ms

From which we can determine that:

59.80ms - 17.41ms == 42.39ms
42.39ms / 25 arguments == 1.69ms

Each argument added also adds an additional 1.69ms overhead because of pushing it onto the stack.

Calls

Now we can look at the number of calls. A small OpenGL application might use ~2,000 C calls, a large AAA game might use ~100,000. We can see here that as the number of calls increases the CGO overhead, making it more significant:

Batch Size Calls stack-arguments With CGO With Batching
5 1000 5 278.877µs 151.485µs
5 5000 5 1.31546ms 743.46µs
5 10000 5 1.298139ms 786.134µs
5 100000 5 15.337915ms 7.587906ms
5 1000000 5 132.128883ms 74.779551ms

This data shows that even with a small number of C calls (1000) batching of only five C calls at a time, can lower the CGO overhead by roughly 50%.

Batch Size

And one last test, what if we run similar data as above but increase or decrease the batch size (i.e. instead of calling 1 CGO call per five C calls like we did above)

Batch Size Calls stack-arguments With CGO With Batching
1 100000 5 16.047711ms 21.62803ms
5 100000 5 14.696841ms 7.61242ms
10 100000 5 13.089094ms 6.379302ms
15 100000 5 16.557063ms 5.781183ms
20 100000 5 16.550219ms 5.548122ms
25 100000 5 14.878079ms 5.41668ms
30 100000 5 16.301934ms 5.53583ms

From this we can gather that with a batch size of one (which is identical to not using batching at all) we lose some performance (to be expected). And you can see above that the performance gain is not linear, but at around a batch size of 15 we cap out and stop seeing such performance gains.