smol-rs/async-task

Implicit boxing of large futures causes excessive monomorphization

Opened this issue · 2 comments

async-task contains an optimisation for handling large futures in the definition of spawn_unchecked. This leads to excessive IR size, as one branch instantiates RawTask with Fut and the other does so with Pin<Box<Fut>>. This probably gets eliminated within LLVM (as the branch itself is trivial), but it's still a bummer that this cannot be truly determined at compile time. I took several stabs recently at getting rid of the unnecessary instantiation, without luck. I do understand why we need the boxing, but it'd be nice to not spend time on generating code we're gonna throw away in LLVM anyways.
Getting rid of large-future-boxing reduces the LLVM IR size for my medium-size (~1.5M LLVM IR lines in debug) crate by 7%. This is also replicated in examples from this crate:

Example name LLVM IR before LLVM IR after % of original
spawn 18276 15631 85.52%
spawn-local 39801 34537 86.77%
spawn-on-thread 18667 16031 85.87%
with-metadata 32834 24887 75.79%

Related: rust-lang/rust#85836

Does this affect release builds too? I would be okay with doing something about this if it affects compile times. The crate itself compiles in 0.22s on my machine, but if the issue is monomorphisation I can see why it would be an issue.

Ideally any compile time improvements would be tested on a larger-scale crate that depends on async-task, like tide's examples. Before we commit to any new patterns we should benchmark against that.

One possible way would be to disable the Boxing branch on debug_assertions or with a specific feature flag. Again, I would like to see benchmarks before I commit to this.

Yeah, this does affect release builds as well. I wrote up a trivial benchmark:

Debug Release
Baseline 0.9s 2s
Patched 0.65s 1.35s
% of Baseline 72% 67%
LLVM Lines Baseline 170310 144200
LLVM Lines Patched 85023 71760
LLVM Lines % of Baseline 50% 50%

You can increase the # of lines and the compile time will scale linearly, as on each line we essentially instantiate a new scheduler and future. If the user were to Box the scheduler, erasing the type, we would scale by the # of distinct return types instead.
I'm not sure if tide is a great benchmark there, as # of instantiations of async-task seems rather scarce in the examples, so changes are not likely to show up in timings. I think we could start with an isolated benchmark and move from there.

I would err on the side of disabling the boxing branch with a specific feature flag (or rather, enabling the boxing with a feature flag). That feature could be enabled by default.
Even if the user always boxes their futures, they have to pay the compile-time cost (as then the "boxing branch" tries to instantiate Pin<Box<Pin<Box<F>>>>). In that case being able to disable the implicit boxing altogether would be great for them.
I'm pretty positive that there's other avenues we could explore to improve the compile time further for downstream users. I'm glad that you're open to that. :)