shader-slang/slang

Speed up core library loading, bring compilation time of small files down to <10ms mark

Opened this issue · 9 comments

Description:

Core module takes a long time to load, which in turn slows down all operations with CLI, improve that.
Compilation of hello.slang with slangc should be around 1ms mark.

This will require a number of changes in the codebase.


Old description:

> time slangc --help
slangc --help  1.72s user 0.11s system 98% cpu 1.855 total

Simple call takes almost two seconds! Even chromium starts faster on my machine.
For comparison glslang:

> time glslang --help
glslang --help  0.00s user 0.00s system 95% cpu 0.006 total

ltrace tells that it is stuck during slang_createGlobalSession

strace shows that it tries to newfstatat libslang.so then tries to openat slang-core-module.bin and then it is a long long chain of brk calls and it is stuck during brk calls it seems.

Looks like it tries to allocate a loooooot of memory during startup? using brk????
To open libslang.so??
Shouldn't happen.

slang rev: dbc28b4
os: NixOS 24.11
custom build of slang, if that is related.

Thanks for raising this issue. Is it possible to try the prebuilt releases here to see if you can repro the issue? On my machine it takes .2 seconds for slangc --help. One difference with the prebuilt binaries is that the core module is embedded into the libslang.so itself so there wont be any filesystem calls.

I'm using the prebuilt binaries. For me slangc takes 1.33s, whereas glslang similarly takes 0.00s. My OS is Arch Linux.

edit

Prebuilt binaries are slow, but I compiled it myself and the speed becomes .08 seconds, so not sure what is up with the pre-built binaries. All I did was cmake .. and make after to build, didn't change any settings.

Thanks for raising this issue. Is it possible to try the prebuilt releases here to see if you can repro the issue? On my machine it takes .2 seconds for slangc --help. One difference with the prebuilt binaries is that the core module is embedded into the libslang.so itself so there wont be any filesystem calls.

Prebuilt is indeed faster.
./slangc --help 0.12s user 0.06s system 99% cpu 0.173 total
But system calls more or less look the same.

So, I found the issue, it seems slangc tries to write lib/slang-core-module.bin during first launch, so it takes longer, but on my system that is impossible because artefacts directory is read only, so it always takes long time.
Why is that the case?

I could reproduce the issue like this:

› time ./bin/slangc
./bin/slangc  1.90s user 0.13s system 99% cpu 2.045 total
› time ./bin/slangc
./bin/slangc  0.12s user 0.05s system 99% cpu 0.171 total
› rm lib/slang-core-module.bin
removed 'lib/slang-core-module.bin'
› chmod -w lib                
› time ./bin/slangc           
./bin/slangc  1.90s user 0.12s system 99% cpu 2.028 total
› time ./bin/slangc
./bin/slangc  1.90s user 0.13s system 99% cpu 2.034 total
rm lib/slang-core-module.bin

But still, even 100ms is a long time for a shader compiler and it should be improved, especially when you consider that there is no compilation happening.

Thanks for raising this issue. Is it possible to try the prebuilt releases here to see if you can repro the issue? On my machine it takes .2 seconds for slangc --help. One difference with the prebuilt binaries is that the core module is embedded into the libslang.so itself so there wont be any filesystem calls.

Prebuilt indeed faster. ./slangc --help 0.12s user 0.06s system 99% cpu 0.173 total But system calls more or less look the same.

So, I found the issue, it seems slangc tries to write lib/slang-core-module.bin during first launch, and then it takes longer, but on my system that is impossible because artefacts directory is read only, so it always takes long time. Why is that the case?

I could reproduce the issue like this:

› time ./bin/slangc
./bin/slangc  1.90s user 0.13s system 99% cpu 2.045 total
› time ./bin/slangc
./bin/slangc  0.12s user 0.05s system 99% cpu 0.171 total
› rm lib/slang-core-module.bin
removed 'lib/slang-core-module.bin'
› chmod -w lib                
› time ./bin/slangc           
./bin/slangc  1.90s user 0.12s system 99% cpu 2.028 total
› time ./bin/slangc
./bin/slangc  1.90s user 0.13s system 99% cpu 2.034 total
rm lib/slang-core-module.bin

But still, even 100ms is a long time for a shader compiler and it should be improved, especially when you consider that there is no compilation happening.

This would explain it. When I used the prebuilt, I used a folder that I do not have write permissions to, but when I compiled myself, I observed a slow first run but the next runs ran fine.

I probably should document the behavior about the slang-core-modulr.bin

When building slang for use from read only locations or for any production use, we recommend turning on the SLANG_EMBED_CORE_MODULE setting in cmake. This will precompile the core module and embed the serialized binary of it into the slang library to avoid building core module during first run and writing it out to a file.

The 100ms is the time needed to deserialize the core module. Slang has a much more rich and complex core library compared to glsl and it takes longer to deserialize into memory. Further improving this performance is possible and necessary in the short future since the core module is only going to get bigger. But this requires a deep infrastructural change to support on demand lazy deserialization, so we probably won’t be able to get to it in the very short term. In practice, most engines would want to integrate their shader building workflow in a custom multithreaded build system that calls slang API directly, in which case this initialization cost is amortized over many compiles so it is not a huge issue in those cases.

When building slang for use from read only locations or for any production use, we recommend turning on the SLANG_EMBED_CORE_MODULE setting in cmake.

Does it make sense to enable it in cmake by default? As Contrary to what was said prebuilt releases don't have that option enabled, may be enabled for wasm only I guess. I will open a PR if that is ok.

Further improving this performance is possible and necessary in the short future since the core module is only going to get bigger. But this requires a deep infrastructural change to support on demand lazy deserialization, so we probably won’t be able to get to it in the very short term.

It would be nice to remove deserialization step entirely, either by parsing and compiling stdlib on the fly, or by compiling in the standard library into the compiler binary without any runtime parsing. Not sure how doable is that. First approach may need lazy/incremental compilation too. For second approach stdlib can just be implemented in the C++ code or something. For first approach also suggest looking at Naga as it is one of the fastest shader compilers out there.
With this I think you can drop lz4 dependency too.

In practice, most engines would want to integrate their shader building workflow in a custom multithreaded build system that calls slang API directly, in which case this initialization cost is amortized over many compiles so it is not a huge issue in those cases.

Oh, that's good to hear. With that initial slowdown doesn't matter much indeed.

The binary blob we encode into the slang library is already the parsed and checked AST and pregenerated IR of the core module, so it is already post parsing. But still it is quite large and takes time to be fully deserialized. Compiling from source will take much longer — almost 2s like you are experiencing on the first runs.

And hard coding the generation of AST nodes and IR of core module declarations won’t be much faster than just deserialize the binary blob either. I think the right way to make things faster is to make the serialied binary random accessible so we only deserialize an AST or IR node when it is actually used by the shader we are compiling.

If the prebuilt binary is not embedding the core module, that is a regression and we need to fix. It is likely introduced when we refactor our cmake scripts.