mesonbuild/meson

Speed-up compiler checks by parallelizing where possible

nirbheek opened this issue · 53 comments

Some compiler checks can be parallelized, for example:

  • get_supported_arguments()
  • get_supported_link_arguments()
  • cross_compute_int(): there are two branches at each guess, so we can run both at the same time, which will improve the worst-case time.

We should also add map-like API for batch-processing several has_function() and has_header() checks at once and sets values in a configuration_data() object, usually of the form:

cdata = configuration_data()
if cc.has_function(func_name, prefix: includes)
   cdata.set(func_name.to_upper().underscorify())
endif

Usually this requires lambda support (or similar) to be able to do this, but since the function in this case is almost always the same, we can hard-code it in Meson. It would be a good default in case we want to extend it later anyway.

One simple way to parallelize has_header() and has_function() is to add get_supported_headers() and get_supported_functions(). That would definitely speedup that kind of foreach loop: https://gitlab.gnome.org/GNOME/glib/blob/master/meson.build#L260.

I have copied the list of 47 headers from @xclaesse's link to an empty project, hacked in parallelism using concurrent.futures and timed meson setup. The results I get (laptop with 2 cores, minimum of multiple runs) are:

  • Same thread: 0m0.904s.
  • ThreadPoolExecutor(1): 0m0.932s.
  • ThreadPoolExecutor(2): 0m0.687s.
  • ThreadPoolExecutor(3) and up: same as 2 threads.

Could you run that same project but with cc.compile('#include <@0@>'.format(header)) instead of cc.has_header(), please? That could give some useful data for issue #2246.

  • Same thread: 0m1.401s.
  • ThreadPoolExecutor(1): 0m1.403s.
  • ThreadPoolExecutor(2): 0m1.008s.
  • ThreadPoolExecutor(3): 0m0.926s.

The problem here is that the result of most compiler checks is used immediately afterwards so there is not that many pending checks in flight at the same time. Parallelizing the potential new functions described by xclaesse seems easier (though you probably can't tell for real without measuring).

For reference, what I measured is an implementation of get_supported_headers.

Regarding whether parallelization is worth it, I am not sure. Since the checks are performed in an imperative fashion, there does not tend to be many batches to perform in parallel.

What is a good example of a meson project where the configure phase takes a long time? Maybe glib? I'd be interested to get a sense of how long meson setup of such a project takes.

tp-m commented

gst-build with --wrap-mode=forcefallback perhaps, but glib is also good on its own probably.

hacked in parallelism using concurrent.futures
...
Regarding whether parallelization is worth it, I am not sure. Since the checks are performed in an imperative fashion, there does not tend to be many batches to perform in parallel.

concurrent.futures + ThreadPoolExecutor is going to have a huge overhead, so you will see some disappointing numbers.

The last time I investigated this, await + async subprocess seemed like it would yield much better outcomes because it's single-threaded in the interpreter, and allows us to spawn several processes at once and aggregate their results before returning from get_supported_headers().

I would recommend making an artificial test case that runs has_header() on 100 unique headers from /usr/include sequentially, and in parallel.

There is overhead. Perhaps not significant. Result of this gist:

Popen(1): 2.9786059856414795
Popen(2): 2.530482292175293
Popen(3): 2.346256732940674
Thred(1): 4.479013442993164
Thred(2): 3.3195695877075195
Thred(3): 2.823971748352051

Might be significantly larger on slow systems like a raspberry pi. I think async/await will also look more elegant when implemented, but that remains to be seen. :)

Result of this gist:

Actually, looking at the gist again, you don't use async await, you use a synchronous wait, so the Popen in the next loop won't be called till the previous ones finish. I don't know if that'sThat's not representative of what I was suggesting because we lose some parallelism there. Not sure how much.

You are right of course. I've hacked the gist to do it correctly (I think). The results:

Popen(1): 2.926682710647583
Popen(2): 2.033327579498291
Popen(3): 2.030749797821045
Thred(1): 4.3277623653411865
Thred(2): 3.157244920730591
Thred(3): 2.8426740169525146

I was actually thinking of the async version of subprocess which returns a coroutine, and then you can await on N of them, where N = num_cpus + 1.

It does something very similar to what you implemented, except that the Python overhead should be lower and you don't need to manually manage things.

But I think your numbers already show that this method should be faster. What do you think?

I've added an asyncio version to the gist.

The asyncio version seems to interfere with the others, so I split each of the versions and run them separately. Maybe it's an fd leak or some SIGCHLD problem.

Also the results differ between python3.6 and python master so I include results for both.

Python 3.6:

Popen(1): 2.4133994579315186
Popen(2): 1.6318774223327637
Popen(3): 1.592148780822754
Thred(1): 4.405722618103027
Thred(2): 3.226512908935547
Thred(3): 2.80383038520813
Async(1): 4.152508497238159
Async(2): 4.115453243255615
Async(3): 3.9369168281555176

Python master (05f1c8902c78dce66aed067444e2b973221bae2b):

Popen(1): 2.4290401935577393
Popen(2): 1.6083247661590576
Popen(3): 1.5807161331176758
Thred(1): 3.6656012535095215
Thred(2): 2.5956883430480957
Thred(3): 2.36556339263916
Async(1): 3.67629075050354
Async(2): 3.5747973918914795
Async(3): 3.4514098167419434

Note: I did make sure that concurrency works in the asyncio version (by replacing /bin/true with /bin/sleep 0.1).

These are very interesting results, thanks for taking the time to do all that :)

I guess Popen.wait() is our best option!

Scanning through the meson.build for mesa, I see a pile of checks of the form

if cc.compiles('int foo(void) __attribute__((__noreturn__));',
               name : '__attribute__((__noreturn__))')
  pre_args += '-DHAVE_FUNC_ATTRIBUTE_NORETURN'
endif

None of the cc.compiles() calls are dependent on each other; they all have to execute. If meson could do some sort of lazy evaluation thing where the effects of the if are delayed, allowing a bunch of the checks to run in parallel, that would seem to do the trick.

The problem there is that the result of the check is needed immediately after it is started. The interpreter can not proceed until the test is finished, thus the tests can not be interleaved.

We could add compiler-esque reordering of code, but that's way too much overengineering. It's simpler to add new API to do the same job better.

It's simpler to add new API to do the same job better.

Unfortunately, for large projects like mesa that need to be able to build on old distros, requiring new syntax for this sort of optimization means that we either need to bump our base meson version (not that easy to do) or we have to wait, possibly several years, before we can use it.

The problem there is that the result of the check is needed immediately after it is started. The interpreter can not proceed until the test is finished, thus the tests can not be interleaved.

Yes, that's a problem that needs solving.

We could add compiler-esque reordering of code, but that's way too much overengineering.

I don't think that compiling it into an IR and re-ordering instructions is really needed. I gave this a bit of thought and I think there's a far simpler solution that still allows for an in-order interpreter. The idea I had was as follows:

  1. Choose some subset of things that can be handled lazily: string append, list append, and set add would make for a good start
  2. Make the compiler check functions return some sort of token (I'll call it a LazyConditional for ease of terminology) instead of an actual boolean. The moment token.__bool__() is called, it stalls for the result and gives you the actual bool.
  3. Add some sort of LazyConditionalExpression class which takes a LazyConditional and expresses any one of the expressions in 1. If any method other than a few specific to LazyConditionalExpression is called, it stalls, evaluates, and then behaves exactly like the thing it evaluated to. (Yay, python and duck typing let us do this pretty easily.)
  4. When you see an if whose condition is a LazyExpression, you speculatively enter the if. If you ever encounter anything inside the if which cannot be lazily evaluated, you stall for the actual expression and bail out if it's false. If you encounter something which can be lazily evaluated, you create a LazyConditionalExpression in place of the actual expression and continue on.

With the above scheme, most of mesa's configuration as well as the header example from glib would just automatically be parallelized without any change to the meson syntax. Without knowing anything about the meson code-base, it also doesn't seem like it'd be all that much code. Just a thought, take it or leave it. 😄

I was wondering if we could have threads that jump straight to any compile/link/run checks, regardless of any conditionals, and if all args are literals it could exec it to populate the cache we already have. Then when the real interpreter arrive at that place the result will already be in cache.

It could do some simple variable handling, it does not have to be perfect, in worst case it does useless compilations and the good one will have a cache miss.

@jekstrand's proposal sounds remarkably like asyncio.Future (or coroutines, if you prefer async/await).

I think some of mesa's slowness could be solved via me finishing #2393, and then maybe we could add caching for them, since they're purely compiler dependent.

We benchmarked asyncio/coroutines above and found that the overhead is far too high. subprocess.Popen() which runs process asynchronously had the lowest overhead.

coroutines work better the more of them you use by their nature. Setting off 4 coroutines and then waiting on them is going to be slow, that's what thread pools are for.

The most expensive things I see in mesa are:

  1. checking for builtin functions
  2. checking for gcc attributes

I think we can solve some of those with helper functions in meson, which could be cached. The same build of GCC will have the same set of builtin functions and __attribute__ support every time.

I see that cc.get_supported_arguments() checks if given arguments are supported one by one. But it seems like gcc and clang both emits warning for all unsupported arguments given.

So running clang/gcc twice (or thrice?) in total is enough to find all supported arguments, and the error/warnings can be parsed to get the unsupported arguments.

Eg;

gcc hello.c -Wcast-align -Wdeclaration-after-statement -Werror=address\
-Werror=array-bounds -Werror=empty-body -Werror=parenthesis -Wformat-nonliteral -mwindows

prints

gcc: error: unrecognized command line option ‘-mwindows’; did you mean ‘-m3dnow’?

Now, running gcc again, after removing the unsupported argument:

gcc hello.c -Wcast-align -Wdeclaration-after-statement -Werror=address\
-Werror=array-bounds -Werror=empty-body -Werror=parenthesis -Wformat-nonliteral

This gives:

cc1: error: -Werror=parenthesis: no option -Wparenthesis
cc1: warning: -Wformat-nonliteral ignored without -Wformat [-Wformat-nonliteral]

That's it. No matter how many arguments are to be tested, 2-3 invocations is enough to find the [un]supported arguments.

I don't know if there is a reason for not using this, may be there is something I don't know. But anyway, this seems to work for clang and gcc, at least for my cases.

Interesting that I found this thread.

We use meson now in many projects. EFL now has a meson build. On a very fast desktop it takes 9.7 seconds. That's after the patches that speed up the end phase submitted by Marcel (as a result of him doing the hard yards of getting EFL to have a meson build). Meson still spends an eternity at the end thinking about stuff though.

Paralellization would really be helpful. Almost all of our checks could be parallelized as they just check if some lib or dependency or header exists and then set a variable to true/false (or append to a string).

The same meson configure stage above on a Rpi3 takes a LOT longer: 224.4 sec. I also have a high-core-count ARM workstation (Thunder X2 - 256 cores) which actually struggles to keep up with the i7-7700k that does the above 9.7 seconds because it gets hung in single threaded parts of meson for far far far longer. The workstation catches up to the i7 quickly when it gets to compiling, but is held back by meson being single threaded.

So we'd absolutely love some paralellization even if it is new features we have to use and thus require a new meson version. Perhaps something as simple as some tagged blocks that you can wait for later on like just conceptually:

cpu_mmx = false
bob = async
  if compiler.check_header('immintrin.h') == true
    cpu_mmx = true
  endif
endasync
...
bob.wait()
if cpu_mmx == true
  config_h.set10('BUILD_MMX', true)
endif

The more async blocks we create, the more meson can queue in parallel in a work queue + thread/pipe+child exe pool, and then just the waits cause stalls.

Just an idea. Perhaps there are other ways. But just saying that this kind of thing would be massively useful and considerably improve build times. A cached build of EFL on the Rpi3 is spending ~22% of its time just in the meson configure stage right now (out of the whole configure, build and install). That could be easily halved of not cut down to 1/3 the time I think, so that would cut down total build time by 15% ... and that wouldn't be bad. In the ARM workstation I imagine the win would be even more dramatic possibly cutting down 20% of build time.

Whatever approach ends up taken, changing the Meson language to expose asyncness or something similar to the end user will not be accepted. The syntax must be as declarative as possible.

As an alternative approach we could have something like a batch lookup function:

results = cc.has_headers('foo.h', 'bar.h', 'baz.h', ...)
if results['foo.h']:
   do_something()
endif
# etc

+1 for batch api that returns dict. One easy case is parallelizing get_supported_flags().

Could be get_supported_headers() maybe.

I would say that a wait blah is about as declarative as if and for. it's saying that whatever is inside the wait is dependent on blah, thus indicating a dependency. explicitly creating N thread where each is procedural would not be declarative.

tp-m commented

Arguably this kind of thing could (and should) be done in Meson itself by building a dependency tree or such in the parser/interpreter instead of processing things on a line-by-line basis, rather than adding something to the language syntax?

if meson did that then we'd be super happy as it's be zero changes to our build files... but i got the gist that that was going to be far too much involved work.

@tp-m I agree, but that sounds like extremely difficult to do, no?

@xclaesse Perhaps, but I don't think it's intractable. You don't need a full compiler IR with code re-ordering; you just need a good async framework which kicks off the task immediately but doesn't wait for it until the result is actually needed. As long as such a framework is capable of handling the common cases of looking for a dozen headers or checking a bunch of flags, it'd already be a significant improvement.

@jekstrand sadly that would be too easy. Take for example this code in glib:

foreach h : headers
  if cc.has_header(h)
    define = 'HAVE_' + h.underscorify().to_upper()
    glib_conf.set(define, 1)
    glib_conf_prefix = glib_conf_prefix + '#define @0@ 1\n'.format(define)
  endif
endforeach

it won't be able to parallelize that code, because it must have the result for each iteration.

IMHO, by far the easier is to rewrite that kind of code to

foreach h : cc.get_supported_headers(headers):
    define = 'HAVE_' + h.underscorify().to_upper()
    glib_conf.set(define, 1)
    glib_conf_prefix = glib_conf_prefix + '#define @0@ 1\n'.format(define)
endforeach

If the glib_conf set could handle a lazy conditional add (see description above) and only resolve the condition later when the actual set is needed, there wouldn't be a problem. (See my comment on Sep 3.) Admittedly, it's probably more complicated than I made it sound but I don't think it's intractable.

Yeah, so it quickly becomes much more complicated. Don't get me wrong, I agree that would be the ideal solution, but since that requires much more work, I fear it won't happen. I think a few batch APIs is easier, and could solve most of the performance issues. It is also nicer syntax in my example above.

A low hanging fruit would be to parallelize get_supported_args() that already exists as batch API.

tp-m commented

It wouldn't be able to parallelise that loop, but it might be able to parallelise other things from after the loop already in that case.

Edit: actually, it would be able to parallelise the checks here I think, just not the variable updates.

How about:

bob = async
  something async...
endasync
...
wait bob
  do something that relies on bob
  bob2 = async
    something else...
  endasync
endwait

wait bob2
  do something that relies on results of bob2
endwait

It's pretty declarative and clear what depends on what. It's giving dependencies that logic block B depends on the sync work in logic block A. The dependency tree can be resolved and put on work queues. It allows for any arbitrary amount of things to be done in parallel. Long term it would end up less work than the long tail of trying to make batch APIs for more and more things.

I'm really nervous about adding async semantics to the meson build files, particularly because at that point we're going to allow users to write race conditions in their code, one of the best things about meson IMHO is that it forces you to write correct code due to its simplicity. Beyond that, writing any kind of async is hard, humans are really, really bad at doing work asynchronously. Personally, I'd much rather scope and consider implementing async internally than relying on users doing it right, this fits with meson's philosophy of "do it right once in core meson."

I'm also concerned about the the longterm maintenance burden of having a serial and batch interface for everything (as we would end up needing one for everything since some project would need it)

While writing and maintaining internally asynchronous calls might be more complicated than batch APIs, it would have several advantages:

  • It would be an implementation detail
  • It would make every meson build system faster
  • It wouldn't make the API more complicated

I think those 3 alone are reasons to consider what it would take to use some kind of async internally.

The problem with trying to do the async completely hidden is the one mentioned above: the default, natural way of writing Meson build definitions is to first check and immediately use the result. In order to write fast checks you need to "know" that is not the case and instead store the intermediate results and only use any of them after everything has been submitted. A batch api is nicer in that it is immediately readable and might even be less code than incremental implementation of the same in some cases.

A low hanging fruit would be to parallelize get_supported_args() that already exists as batch API.

Yes. Any volunteers? 😁

The problem with trying to do the async completely hidden is the one mentioned above: the default, natural way of writing Meson build definitions is to first check and immediately use the result.

The LazyConditional approach I suggested on this thread back in September would, I believe, solve this problem. It's entirely possible I'm wrong and things are far more complicated than I think and it's intractable in reality. However, I have yet to see a single comment that has actually tells me why I'm wrong.

However, I have yet to see a single comment that has actually tells me why I'm wrong.

Let me quote myself from the previous comment:

the default, natural way of writing Meson build definitions is to first check and immediately use the result

That means that there is at most one background check running and the interpreter stalls almost immediately for its result since it is used almost immediately after submission. We would want to have many checks ongoing at the same time to get good parallelism.

That being said if we did transparent async sort of stuff, something like that is how it would be implemented.

Add usual caveat here about making performance assumptions without having personally measured it.

I mean, the most common case I can think of for checks is:

if cc.has_header('foo')
  add_project_args('HAVE_FOO_H', language : ['c'])
endif

or

if cc.has_header('foo')
  conf.set('HAVE_FOO_H', 1)
endif

you could easily paralellize either of these using a Promise/Future interface, each caller just has to know that they're getting a Promise and wait.

Since I'm convinced this will work, and no one else seems to be, I'll hack something up so we can evaluate.

How would the second example benefit from async? You can't do anything in the if statement check until the result is actually there. That might even be slower than not doing async because you don't have to do any setup for async work.

I mean if you do this, then the async thing would work and improve things:

c1 = cc.has_header('c1.h')
c2 = cc.has_header('c2.h')
c3 = cc.has_header('c3.h')
c14= cc.has_header('c4.h')

if c1
  # set something
endif
if c2
  # set something
endif
if c3
  # set something
endif
if c4
  # set something
endif

But this one wouldn't:

if cc.has_header('c1.h')
  # set something
endif
if cc.has_header('c2.h')
  # set something
endif
if cc.has_header('c3.h')
  # set something
endif
if cc.has_header('c4.h')
  # set something
endif

the default, natural way of writing Meson build definitions is to first check and immediately use the result

That means that there is at most one background check running and the interpreter stalls almost immediately for its result since it is used almost immediately after submission. We would want to have many checks ongoing at the same time to get good parallelism.

Which is why I said

  1. When you see an if whose condition is a LazyExpression, you speculatively enter the if. If you ever encounter anything inside the if which cannot be lazily evaluated, you stall for the actual expression and bail out if it's false. If you encounter something which can be lazily evaluated, you create a LazyConditionalExpression in place of the actual expression and continue on.

A single level of speculative execution should be sufficient to gobble up 95% of the loops that people are likely to write.

I have a commitment coming up shortly, so I'll be offline for a bit, but I'm currently hacking something up currently.

That massively complicates the implementation of an interpreter.

That's entirely possible that the interpreter changes would be too deep. As someone who's never worked on the meson interpreter, I'm not really in a position to know what the changes would look like. I was mostly trying to suggest one potential solution (in theory) for how it could be done without changing meson syntax.

The problem comes from programs like these, which we should be able to handle:

if cc.has_header('foo.h')
  has = 1
else
  has = 0
endif

#do something with has

When you get to the statement following the if block you need to either:

  • track all possible variables that could have changed in any block, see if any of them are used in the current statement and stall for the results
  • unconditionally wait for all pending async operations to finish and get their results

The former is hideously complicated to implement reliably, which the latter means that you can not launch more than one async operation per if block (and you usually need the result of the check inside the block anyway).

Right, I wasn't really considering trying to accelerate if statements with a non-trivial else. You would either have to speculatively execute both or have an IR and do transformations. Since you don't want to add an IR (I don't blame you), we can throw that one out. In the example you gave above, any speculative execution scheme is likely going to have a stall the moment they use out unless the interpreter goes to extreme heroics with conditional values. That said, how common is that case? I genuinely don't know. Most of the examples that have been given and most of what I've seen in mesa only have the if.

On the one hand, there's the pragmatic "accelerate what we can" which says that if 95% of if statements are one-level with no else then we've accelerated 95% of if statements. On the other hand, no one likes a performance cliff and someone is going to get grumpy if changing they way they write their meson a small amount makes giant differences in whether or not it can be parallelized.

For #5418 I made a demo using asyncio-subprocess, actually running the compiler.
https://github.com/scivision/asyncio-subprocess-examples

I didn't yet reread the excellent discussion above on this issue to make the benchmark more realistic, but in contrast to the Gist, this test actually runs the compiler.

I've updated the example of compiling 15 tiny test files -- like what get_supported_arguments() and similar would do internal to Meson.

https://github.com/scivision/asyncio-subprocess-examples?tab=readme-ov-file#benchmarks

The example compares asyncio, ThreadPoolExecutor, ProcessPoolExecutor with asyncio giving ~ 20x speedup on a 32-core Linux workstation, and ~4x speedup on an 8-core Apple Silicon laptop.