Each function to have its own file in order to speed up compilation
Opened this issue · 13 comments
Right now, especially for large projects (I tested with teal, which is currently 11012 lines) the generated C source files are MASSIVE (the source file for teal was nearly 600k lines long), and so compilation is very sluggish (see figure 1). Build systems such as Make allow for parallel execution in order to compile files, so if each function got its own file (possibly behind a flag) that system could be utilised.
real 28.69 user 24.49 sys 2.06 959619072 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 509884 page reclaims 3427 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 120 voluntary context switches 63026 involuntary context switches 693110188 instructions retired 377931574 cycles elapsed 6746112 peak memory footprintFigure 1, the time it takes to compile teal using clang on an x86_64 macOS system with an Intel i7-7920HQ 8 core CPU at 3.10GHz
Hmm, that's a cool idea I hadn't thought about. (Although I suppose it'll only help in recompilations)
@hugomg I implemented a WIP version on my fork of the project, the benefits are immense, speeding up compilation especially when using Make's jobs. If you think this is a good feature, I will create an implementation from scratch for Lua-aot
Interesting! So if I understand correctly this is mostly to exploit more parallelism, right? That is, this isn't about speeding up recompilations by skipping over functions that did not change?
Interesting! So if I understand correctly this is mostly to exploit more parallelism, right? That is, this isn't about speeding up recompilations by skipping over functions that did not change?
Not only that, but with initial compilations its also extremely useful because of the fact it can be parallelised
How does it work during recompilations? If I were to edit a single function in the Lua source file, how would LuaAOT know that it should not recreate the C files for the other functions?
How does it work during recompilations? If I were to edit a single function in the Lua source file, how would LuaAOT know that it should not recreate the C files for the other functions?
It does recreate them, but make can tell that the files are the same, so it only recompiles the file that changed (And also the file with the source array, the main generated file)
Oh, how? I assumed it would only look at the timestamps.
Oh, how? I assumed it would only look at the timestamps.
Oh, you are correct, I have just been testing on first compiles (as this was my main use case, to speed up this type of compile). That is a good feature to implement, ill look into it further
Nevertheless, just the parallelization would be super cool. How much faster did it get when compiling Teal?
Nevertheless, just the parallelization would be super cool. How much faster did it get when compiling Teal?
lemmie get the actual numbers, but it is much much faster...
...with the downside that for some reason teal, out of any lua module I have tried, has a "syntax error in module"
Nevertheless, just the parallelization would be super cool. How much faster did it get when compiling Teal?
lemmie get the actual numbers, but it is much much faster...
...with the downside that for some reason teal, out of any lua module I have tried, has a "syntax error in module"
this might be a lua-aot issue? I just tried it with a commit from master and it also failed
Paralelised (this is tl.lua + argpsarse.lua + tl (the teal CLI thing)):
real 35.88
user 181.62
sys 27.32
360148992 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
8005054 page reclaims
17099 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
381 signals received
792 voluntary context switches
267447 involuntary context switches
38190257 instructions retired
40020550 cycles elapsed
593920 peak memory footprint
and then standard:
real 26.24
user 30.35
sys 1.97
977633280 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
628716 page reclaims
4390 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
9 signals received
67 voluntary context switches
36856 involuntary context switches
36556234 instructions retired
38082170 cycles elapsed
569344 peak memory footprint
That was with -O0
, but with -Os
and -flto
:
Paralelised:
real 113.44
user 391.85
sys 36.62
973549568 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
10396353 page reclaims
7057 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
402 signals received
728 voluntary context switches
418147 involuntary context switches
36537154 instructions retired
43363504 cycles elapsed
557056 peak memory footprint
real 184.18
user 187.13
sys 7.34
1475518464 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
1498323 page reclaims
2391 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
9 signals received
277 voluntary context switches
373720 involuntary context switches
41433463 instructions retired
45485784 cycles elapsed
565248 peak memory footprint
This is with Clang, I will test with GCC soon
GCC -O0:
Paralellised:
real 91.97
user 420.87
sys 89.62
834879488 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
17945692 page reclaims
25523 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
466 signals received
4898 voluntary context switches
894652 involuntary context switches
39600775 instructions retired
49697973 cycles elapsed
614400 peak memory footprint
Standard:
real 78.96
user 81.82
sys 7.20
1830170624 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
2132645 page reclaims
14957 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
9 signals received
480 voluntary context switches
189375 involuntary context switches
39587389 instructions retired
44594929 cycles elapsed
548864 peak memory footprint
GCC -Os -flto:
Paralellised:
real 112.70
user 417.72
sys 81.44
1438064640 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
17689835 page reclaims
8252 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
468 signals received
5174 voluntary context switches
739662 involuntary context switches
38432377 instructions retired
42785145 cycles elapsed
548864 peak memory footprint
standard:
real 225.54
user 215.98
sys 19.25
1434435584 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
5233250 page reclaims
6558 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
9 signals received
5133 voluntary context switches
511935 involuntary context switches
39483306 instructions retired
40934079 cycles elapsed
581632 peak memory footprint