Implicit boxing of large futures causes excessive monomorphization

Question

Implicit boxing of large futures causes excessive monomorphization

Opened this issue 9 months ago · 2 comments

async-task contains an optimisation for handling large futures in the definition of spawn_unchecked. This leads to excessive IR size, as one branch instantiates RawTask with Fut and the other does so with Pin<Box<Fut>>. This probably gets eliminated within LLVM (as the branch itself is trivial), but it's still a bummer that this cannot be truly determined at compile time. I took several stabs recently at getting rid of the unnecessary instantiation, without luck. I do understand why we need the boxing, but it'd be nice to not spend time on generating code we're gonna throw away in LLVM anyways.
Getting rid of large-future-boxing reduces the LLVM IR size for my medium-size (~1.5M LLVM IR lines in debug) crate by 7%. This is also replicated in examples from this crate:

Example name	LLVM IR before	LLVM IR after	% of original
spawn	18276	15631	85.52%
spawn-local	39801	34537	86.77%
spawn-on-thread	18667	16031	85.87%
with-metadata	32834	24887	75.79%

Related: rust-lang/rust#85836

Answer 1 · 2023-12-19T23:48:24.000Z

Does this affect release builds too? I would be okay with doing something about this if it affects compile times. The crate itself compiles in 0.22s on my machine, but if the issue is monomorphisation I can see why it would be an issue.

Ideally any compile time improvements would be tested on a larger-scale crate that depends on async-task, like tide's examples. Before we commit to any new patterns we should benchmark against that.

One possible way would be to disable the Boxing branch on debug_assertions or with a specific feature flag. Again, I would like to see benchmarks before I commit to this.

Answer 2 · 2023-12-20T17:01:51.000Z

Yeah, this does affect release builds as well. I wrote up a trivial benchmark:

	Debug	Release
Baseline	0.9s	2s
Patched	0.65s	1.35s
% of Baseline	72%	67%
LLVM Lines Baseline	170310	144200
LLVM Lines Patched	85023	71760
LLVM Lines % of Baseline	50%	50%

You can increase the # of lines and the compile time will scale linearly, as on each line we essentially instantiate a new scheduler and future. If the user were to Box the scheduler, erasing the type, we would scale by the # of distinct return types instead.
I'm not sure if tide is a great benchmark there, as # of instantiations of async-task seems rather scarce in the examples, so changes are not likely to show up in timings. I think we could start with an isolated benchmark and move from there.

I would err on the side of disabling the boxing branch with a specific feature flag (or rather, enabling the boxing with a feature flag). That feature could be enabled by default.
Even if the user always boxes their futures, they have to pay the compile-time cost (as then the "boxing branch" tries to instantiate Pin<Box<Pin<Box<F>>>>). In that case being able to disable the implicit boxing altogether would be great for them.
I'm pretty positive that there's other avenues we could explore to improve the compile time further for downstream users. I'm glad that you're open to that. :)