hugomg/lua-aot-5.4

Each function to have its own file in order to speed up compilation

Opened this issue · 13 comments

Right now, especially for large projects (I tested with teal, which is currently 11012 lines) the generated C source files are MASSIVE (the source file for teal was nearly 600k lines long), and so compilation is very sluggish (see figure 1). Build systems such as Make allow for parallel execution in order to compile files, so if each function got its own file (possibly behind a flag) that system could be utilised.

real 28.69
user 24.49
sys 2.06
          959619072  maximum resident set size
                  0  average shared memory size
                  0  average unshared data size
                  0  average unshared stack size
             509884  page reclaims
               3427  page faults
                  0  swaps
                  0  block input operations
                  0  block output operations
                  0  messages sent
                  0  messages received
                  0  signals received
                120  voluntary context switches
              63026  involuntary context switches
          693110188  instructions retired
          377931574  cycles elapsed
            6746112  peak memory footprint

Figure 1, the time it takes to compile teal using clang on an x86_64 macOS system with an Intel i7-7920HQ 8 core CPU at 3.10GHz

Hmm, that's a cool idea I hadn't thought about. (Although I suppose it'll only help in recompilations)

@hugomg I implemented a WIP version on my fork of the project, the benefits are immense, speeding up compilation especially when using Make's jobs. If you think this is a good feature, I will create an implementation from scratch for Lua-aot

Interesting! So if I understand correctly this is mostly to exploit more parallelism, right? That is, this isn't about speeding up recompilations by skipping over functions that did not change?

Interesting! So if I understand correctly this is mostly to exploit more parallelism, right? That is, this isn't about speeding up recompilations by skipping over functions that did not change?

Not only that, but with initial compilations its also extremely useful because of the fact it can be parallelised

How does it work during recompilations? If I were to edit a single function in the Lua source file, how would LuaAOT know that it should not recreate the C files for the other functions?

How does it work during recompilations? If I were to edit a single function in the Lua source file, how would LuaAOT know that it should not recreate the C files for the other functions?

It does recreate them, but make can tell that the files are the same, so it only recompiles the file that changed (And also the file with the source array, the main generated file)

Oh, how? I assumed it would only look at the timestamps.

Oh, how? I assumed it would only look at the timestamps.
Oh, you are correct, I have just been testing on first compiles (as this was my main use case, to speed up this type of compile). That is a good feature to implement, ill look into it further

Nevertheless, just the parallelization would be super cool. How much faster did it get when compiling Teal?

Nevertheless, just the parallelization would be super cool. How much faster did it get when compiling Teal?

lemmie get the actual numbers, but it is much much faster...

...with the downside that for some reason teal, out of any lua module I have tried, has a "syntax error in module"

Nevertheless, just the parallelization would be super cool. How much faster did it get when compiling Teal?

lemmie get the actual numbers, but it is much much faster...

...with the downside that for some reason teal, out of any lua module I have tried, has a "syntax error in module"

this might be a lua-aot issue? I just tried it with a commit from master and it also failed

Paralelised (this is tl.lua + argpsarse.lua + tl (the teal CLI thing)):

real 35.88
user 181.62
sys 27.32
           360148992  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             8005054  page reclaims
               17099  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                 381  signals received
                 792  voluntary context switches
              267447  involuntary context switches
            38190257  instructions retired
            40020550  cycles elapsed
              593920  peak memory footprint

and then standard:

real 26.24
user 30.35
sys 1.97
           977633280  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              628716  page reclaims
                4390  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   9  signals received
                  67  voluntary context switches
               36856  involuntary context switches
            36556234  instructions retired
            38082170  cycles elapsed
              569344  peak memory footprint

That was with -O0, but with -Os and -flto:

Paralelised:

real 113.44
user 391.85
sys 36.62
           973549568  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            10396353  page reclaims
                7057  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                 402  signals received
                 728  voluntary context switches
              418147  involuntary context switches
            36537154  instructions retired
            43363504  cycles elapsed
              557056  peak memory footprint
real 184.18
user 187.13
sys 7.34
          1475518464  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             1498323  page reclaims
                2391  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   9  signals received
                 277  voluntary context switches
              373720  involuntary context switches
            41433463  instructions retired
            45485784  cycles elapsed
              565248  peak memory footprint

This is with Clang, I will test with GCC soon

GCC -O0:
Paralellised:

real 91.97
user 420.87
sys 89.62
           834879488  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            17945692  page reclaims
               25523  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                 466  signals received
                4898  voluntary context switches
              894652  involuntary context switches
            39600775  instructions retired
            49697973  cycles elapsed
              614400  peak memory footprint

Standard:

real 78.96
user 81.82
sys 7.20
          1830170624  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             2132645  page reclaims
               14957  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   9  signals received
                 480  voluntary context switches
              189375  involuntary context switches
            39587389  instructions retired
            44594929  cycles elapsed
              548864  peak memory footprint

GCC -Os -flto:

Paralellised:

real 112.70
user 417.72
sys 81.44
          1438064640  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
            17689835  page reclaims
                8252  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                 468  signals received
                5174  voluntary context switches
              739662  involuntary context switches
            38432377  instructions retired
            42785145  cycles elapsed
              548864  peak memory footprint

standard:

real 225.54
user 215.98
sys 19.25
          1434435584  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             5233250  page reclaims
                6558  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   9  signals received
                5133  voluntary context switches
              511935  involuntary context switches
            39483306  instructions retired
            40934079  cycles elapsed
              581632  peak memory footprint