nats-io/nats.net.v2

Potential issue with codegen for struct types

Opened this issue · 5 comments

mtmk commented

Have mixed feelings about this, would need to check what else has changed.

Main reason being, it looks like we've lowered a generic call further down (i.e. this is now <T> rather than ReadOnlySequence<byte>) and that may have implications around codegen for struct types.

Originally posted by @to11mtm in #303 (comment)

I have been browsing issues and found this one. What kind of codegen implications are you worried about regarding structs? Perhaps I can help.

Usually, the corresponding method bodies get their own monomorphized instantiations with optimal code and all generic-related checks get optimized away - CoreLib relies on this mechanic quite heavily to write zero-cost abstraction code which suspiciously looks like C++ templates.

I have been browsing issues and found this one. What kind of codegen implications are you worried about regarding structs? Perhaps I can help.

Usually, the corresponding method bodies get their own monomorphized instantiations with optimal code and all generic-related checks get optimized away - CoreLib relies on this mechanic quite heavily to write zero-cost abstraction code which suspiciously looks like C++ templates.

The concern to be had here is that the lower monomorphized calls exist, the more unique code branches exist... the lesser impact is that type args may have to get passed around more, but more important concern is that in scenarios where consumers are primarily using (a variety of) different struct types in their app, the number of monomorphized instantiations may be detrimental to code caches/etc.

This may be a scenario worth it's own benchmarks (it could very well be not worth worrying about,) That will give us more data and potentially see whether there are other potential optimizations to be had in the call. One option that sometimes helps is taking the 'nongeneric' sections of a method and moving those into calls (This way, the monomorphized code can be smaller and common bits stay in cache.)

It seems there is quite a bit going on so let me address these one by one 😄

  • struct generic instantiations do not create more branches (unless you mean manually written code which does unbox + switch on type)
  • the number of generic instantiations is something that even CoreLib does not care about in 95% of situations
  • CPU cache does not care for the number of generic instantiations either, it is populated along the execution path guided by branch predictor and is either way so large that it's not a concern in case of .NET
  • optimizations in dotnet/runtime care about code size specifically because it usually means reducing the amount of work done, which is exactly what struct generics do since they eliminate generic dispatch cost that exists for objects (if methods are called on them)

What actually does matter however is large method bodies that may cause JIT to regress due to running into either its inliner or locals tracked budget which causes the codegen quality to degrate significantly. For these, it is optimal to simply have high level throughput/latency benchmarks with spot checks of codegen against hot paths detected by profiler (which you probably know better than I do :) )

struct generic instantiations do not create more branches (unless you mean manually written code which does unbox + switch on type)

poor wording here, but what I mean is that, to the best of my understanding, (could be wrong here) but .WritePublish<StructSizeOfInt32One>() and .WritePublish<StructSizeOfInt32Two> would each get their own monomorphized implementations. My understanding (again could be wrong here) is that the old code was writing to buffer, such that each serialized write to command queue was monomorphized, however the actual protocol writer was -not-, there was only the implementation for the enclosed ReadOnlyMemory (or other wrapping construct).

What actually does matter however is large method bodies that may cause JIT to regress due to running into either its inliner or locals tracked budget which causes the codegen quality to degrate significantly. For these, it is optimal to simply have high level throughput/latency benchmarks

Per above this is where my concern lies and where benchmarks would help. As it stands the method in question is ~100 lines (including spacing) so there is a question of if anything can be done to help.

with spot checks of codegen against hot paths detected by profiler (which you probably know better than I do :) )

More than happy to defer to your judgement if this is a nothingburger;
will happily admit if I can't compare the ASM in SharpLab (and shamefully admit my dotTrace workflow is atrocious) I just try to write some fair benchmarks and draw the line there.

poor wording here, but what I mean is that, to the best of my understanding, (could be wrong here) but .WritePublish() and .WritePublish would each get their own monomorphized implementations. My understanding (again could be wrong here) is that the old code was writing to buffer, such that each serialized write to command queue was monomorphized, however the actual protocol writer was -not-, there was only the implementation for the enclosed ReadOnlyMemory (or other wrapping construct).

If the access is synchronized, it's no different as two separate methods calling it. Perhaps another mental abstraction you could apply to this scenario is struct generics working the same way as object generics except being a zero-cost abstraction? They do not introduce program behavior differences in such way.