Proposal: simplified arguments passing and first-class support of in parameters
sakno opened this issue ยท 27 comments
Hi Lucas!
I'm using InlineIL widely in my project .NEXT. Your add-in helping me a lot to write performance-critical code. However, I feel some inconvenience with it.
The first one is an absence of support in-modifiers. Let's take a look at this code:
public static void Bitcast<T, TResult>(in T input, out TResult output)
where T : unmanaged
where TResult : unmanaged
{
const string slowPath = "slow";
Ldarg(nameof(output));
Sizeof(typeof(T));
Sizeof(typeof(TResult));
Blt_Un(slowPath);
//copy from input into output as-is
Ldarg(nameof(input));
Ldobj(typeof(TResult));
Stobj(typeof(TResult));
Ret();
MarkLabel(slowPath);
Dup();
Initobj(typeof(TResult));
Ldarg(nameof(input));
Ldobj(typeof(T));
Stobj(typeof(T));
Ret();
throw Unreachable(); //output must be defined within scope
}
I have to use Ldarg(nameof(input))
and Ldarg(nameof(output))
because there is no overloads with out and in parameters. This cause unwanted warnings from Roslyn about unused parameters.
The second issue is a method call. It's verbose. I like Push
helper that allows to push result of parameterless method or property on the evaluation stack. However, this approach is not working for call sites when arguments are not locals or parameters. In this case I have to use Call
with call site descriptor object. I would like to propose T Arg<T>()
and ref T RefArg<T>()
helper which allow to load argument from the evaluation stack:
Ldc_I4(10);
Ldc_I4(2);
Push(Math.Max(Arg<int>(), Arg<int>());
return Return<int>();
//equivalent to
ldc.i4 10
ldc.i4 2
call Math.Max(int, int)
ret
Additionally, it would be great if InlineIL will be able to recognize params correctly:
Ldstr("Hello, {0}!");
Ldstr("Albert Wesker");
Push(Console.WriteLine(Arg<string>(), Arg<string>()); //automatically creates object[] array in resulted IL code
Ret();
Hello!
About the in
modifier:
I won't change Ldarg
as I want the opcode methods to be mapped "cleanly" to what they do in IL, but I can enhance the helper methods in the IL
class. IL.Push
already has a ref
overload, but changing that to an in
would be a breaking change. I suppose I could add additional IL.PushIn<T>(in T value)
and IL.PushOut<T>(out T value)
helpers for this, so you could use them instead of Ldarg
.
By the way, I'm not getting any warnings on your Bitcast
example, are you talking about actual compiler warnings, or are you getting analyzer warnings in the IDE?
About method calls:
I actually wanted to have something like your proposed T Arg<T>()
method, so I implemented a T IL.Pop<T>()
method at some point, but I quickly realized that it didn't play well with the code that Roslyn generates. It was very unreliable, as I couldn't really depend on the position in the evaluation stack where the call gets inserted, especially when you began to call it inside larger expressions, so it could easily retrieve an unexpected value.
I replaced that with void IL.Pop<T>(out T value)
(and some other void
methods), which is much more reliable as it is necessarily in a single statement, so I could work with what Roslyn generates. But it requires a local variable or argument to output its value to.
I added a couple things to reduce the verbosity: if your method has no overloads, you don't need to specify any parameter type, and the single method will get picked up. Besides, I don't think the following is that bad:
Ldc_I4(10);
Ldc_I4(2);
Call(new MethodRef(typeof(Math), nameof(Math.Max), typeof(int), typeof(int)));
return Return<int>();
Can you point me to some code where this is a burden? Maybe I could figure something out.
By the way, I'm not getting any warnings on your Bitcast example, are you talking about actual compiler warnings, or are you getting analyzer warnings in the IDE?
It depends on the FxCop rule set. I'm using custom rule set which warns about unused parameters
I suppose I could add additional
IL.PushIn<T>(in T value)
andIL.PushOut<T>(out T value)
I'm ok with this.
so I implemented a
T IL.Pop<T>()
Yes, it's unreliable. Anyway, when you using pure IL code you're on dangerous path by default. IMO, Arg<T>()
just can be replaced with literally nothing when translating to IL because this instruction is just a placeholder. Therefore, translation and analysis are very simple. Another drawback of Call
method is a loose of compile-time check for the method presence.
I replaced that with
void IL.Pop<T>(out T value)
This way produces unnecessary local variable.
Of course, the original proposal can be modified. For instance, Arg
placeholder can be marked with EditorBrowsable
attribute to avoid highlights from IDE. Or you can name it as ArgUnsafe
to highlight unsafe and unreliable nature of this method. Or just place it to another class named as UnsafeIL
or even to different inner class or namespace with Pro, Unsafe words in the name.
One more thing: what about to add generic versions of ldobj, stobj, sizeof etc? It will be less verbose than using typeof
keyword. Of course, overloaded methods with Type
parameter should still exist to cover ref value types.
Oh, ok so that's FxCop. I suppose IL.PushIn
and IL.PushOut
would help then. I just added an implementation (it's not released yet).
IMO,
Arg<T>()
just can be replaced with literally nothing when translating to IL because this instruction is just a placeholder.
That's exactly the implementation of IL.Push
, and the former implementation of IL.Pop
๐
Another drawback of
Call
method is a loose of compile-time check for the method presence.
You don't lose it, InlineIL checks for the method presence and fails the build if it's not there.
Of course, the original proposal can be modified. For instance,
Arg
placeholder can be marked withEditorBrowsable
attribute to avoid highlights from IDE. Or you can name it asArgUnsafe
to highlight unsafe and unreliable nature of this method. Or just place it to another class named asUnsafeIL
or even to different inner class or namespace with Pro, Unsafe words in the name.
Yes, that would be a very unsafe method indeed. It would only work if you call a method where all the parameters are Arg<T>()
.
Honestly I wouldn't be comfortable using it, ever. I'm not even comfortable providing something like that in the API. Unless maybe if I add a check for correct usage (which would make sure you use it for all of the parameters of a method call).
One more thing: what about to add generic versions of ldobj, stobj, sizeof etc? It will be less verbose than using
typeof
keyword.
I could add that. I'm not sure I'll like the syntax though, it may not feel very consistent with the existing methods.
For Arg<T>
I can show you actual example. Here is my code. A little bit verbose, isn't? Arg<T>
is perfect in this case because arguments are on the evaluation stack, not in parameters or locals. Moreover, I need to reinterpret managed pointer type (which is allowed by ECMA). Here I'm fully understand what is going on and safety is my problem. Additionally, I'm always inspecting generated IL code with ikdasm.
Also, I see obvious usefulness for ldftn
, ldvirtftn
, ldtoken
. With Arg<T>
it's possible to reference the method without MethodRef
.
Ok, here are some thoughts about your Compare<G>
method, and how to make it shorter:
- You can replace
new TR(typeof(byte)).MakeByRefType()
withtypeof(byte).MakeByRefType()
. You're not using anyTypeRef
feature here, so you don't need it.
(As an aside: I kind of regret introducing constructors for TypeRef
, MethodRef
etc, as I don't like having that new
keyword everywhere, but now we're stuck with it. I may add more static factories on these classes later on.)
- To me it looks like you've unnecessarily written the following part in IL:
Sizeof(typeof(T));
Pop(out uint size);
Push(size);
Sizeof(typeof(G));
Ceq();
Dup();
Brfalse(methodExit);
Pop();
This could have been replaced with C# code like:
if (Unsafe.SizeOf<T>() == Unsafe.SizeOf<G>())
The JIT will always optimize a call to Unsafe.SizeOf<T>()
to the equivalent sizeof
instruction (especially since the method is flagged for aggressive inlining), and it will evaluate the condition as a constant at JIT time and eliminate one of the branches. You can of course embed the SizeOf
method in your assembly. Anyway the JIT will produce the same output.
-
sizeof
is basically free - it's always constant, so there's no need to store it in a local variable. -
Similarly, with a custom helper method, you could write something like:
return Intrinsics.Compare(
ref Unsafe.AsInRef<T, byte>(first),
ref Unsafe.AsInRef<G, byte>(second),
Unsafe.SizeOf<T>()
);
You'd need to write an equivalent to Unsafe.As
which takes an in
parameter instead of ref
, something like:
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ref TTo AsInRef<TFrom, TTo>(in TFrom source)
{
Ldarg(nameof(source));
return ref IL.ReturnRef<TTo>();
}
I'd argue this would be more readable than:
Ldarg(nameof(first));
Ldarg(nameof(second));
Sizeof(typeof(T));
Conv_I8();
Push(Intrinsics.Compare(ref ArgRef<byte>(), ref ArgRef<byte>(), Arg<long>())
Also, I see obvious usefulness for
ldftn
,ldvirtftn
,ldtoken
. WithArg<T>
it's possible to reference the method withoutMethodRef
.
I'm not sure I understand what you mean here.
One more thing: what about to add generic versions of ldobj, stobj, sizeof etc? It will be less verbose than using
typeof
keyword.
I added that. I actually like this syntax. ๐
and it will evaluate the condition as a constant at JIT time and eliminate one of the branches.
Generally, it's not always working in this way. Compare
cannot be inlined by itself. As a result, sizeof
will ask runtime to return the size of the generic argument. Generics are actually passed as hidden arguments as described in .NET ABI. So it's constant value at runtime, but not in assembly code generally. To solve this issue I have to wait for this.
I'm not sure I understand what you mean here.
I meant this:
Ldtoken(Console.WriteLine(Arg<string>(), Arg<object[]>());
But I forgot that many methods have void return type.
I agree that my code can be rewritten with pure C#. However, I saw somewhere in the open issues for CLR that AggressiveInlining
may be ignored in some circumstances (even if the size of IL code is less than 36 bytes). This is like inline in C which recommends the compiler to inline the method body at call site but not enforce this. That's why GCC has extension in the form of always_inline attribute.
The issue you linked does not apply to this case: the T
and G
type parameters of Compare
are constrained to struct types, which means there won't be native code sharing: the JIT will generate a different native method for each combination of type arguments. You're in the best possible scenario for the JIT to optimize out all branches except one (at least if you had written switch (Unsafe.SizeOf<T>())
, I'm not sure if the JIT will be able to optimize out switch (size)
because of the variable).
Even if the Compare
method could handle reference types, and therefore the JIT emitted shareable native code, sizeof
would still yield a constant, as it returns the pointer size when asked for the size of any reference type, and struct types get their own native code anyway.
Also, I wouldn't worry too much about AggressiveInlining
being ignored in this case: they made a ton of optimizations in .NET Core 3.0 by using Unsafe
in many hot paths. These methods are very simple, and none of them gets inlining blocked, it would be a major regression if that happened. Of course, if you're still unsure, you can take a look at the native code emitted by the JIT.
Let's recap our discussion:
PushIn
andPushOut
are mostly expected methodsArg<T>
is nice to have. Some API design work required.
Yes, PushIn
and PushOut
are implemented. But I think I'll rename them to PushInRef
and PushOutRef
as the current names seem confusing.
Overloads like Stobj<T>()
instead of Stobj(TypeRef)
are also implemented.
As for Arg<T>
... I don't think that's a good idea. ๐
You decide, you're an author of this add-in ๐
Interesting fact: I tested Compare
using my benchmarks when it's fully re-implemented without InlineIL using SizeOf<T>
and can tell that it's a slower than roundtrip with local variable. I think that JIT unable to apply constant propagation in combination with switch table.
Yes, that's interesting... I wrote a quick SharpLab test which seems to show that Roslyn always goes through a local variable when using the switch
IL instruction, and maybe the JIT optimizes for that. The native code is always optimized to a single branch in that case.
But that means your rewritten code still should use a local variable for the switch... Could you send me that benchmark please, if you still have it? I'd be interested in playing a bit with it.
Yup, benchmark result for the current implementation: https://sakno.github.io/dotNext/benchmarks.html
Benchmark code: https://github.com/sakno/dotNext/blob/master/src/DotNext.Benchmarks/BitwiseEqualityBenchmark.cs
You can run benchmark project using dotnet run -c Bench
. Bench configuration is the same as Release but doesn't publish packages for NuGet.
The main loss is in GuidBitwiseEqualsMethod
bench method. The current implementation can save 4-5 nanos (which is x2) in comparison with the implementation using Unsafe.SizeOf<T>()
and other helper methods you mentioned above.
Another proof that constant propagation will not lead to inlining of Equals
or Compare
is EnumEqualsUsingBitwiseComparer
bench. In case of constant propagation, the method should be optimized to a few assembly instructions and have the same performance as EqualityComparer.Default
but it's not.
Thanks, I'll take a look at this (maybe tomorrow).
Just a quick note about EqualityComparer<T>.Default
: don't expect to be able to beat it for things like enums. It's a devirtualized JIT intrinsic. ๐
Compare this demo in .NET Core and .NET Framework targets and see the difference.
I took a quick look and I actually got better results with my implementation:
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.720 (1909/November2018Update/19H2)
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.200
[Host] : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Job-BSBVUK : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT DEBUG
BuildConfiguration=Bench LaunchCount=1 RunStrategy=Throughput
Method | Mean | Error | StdDev | Code Size |
---|---|---|---|---|
GuidEqualsMethod | 0.9211 ns | 0.0263 ns | 0.0220 ns | 140 B |
GuidBitwiseEqualsAltMethod | 1.9558 ns | 0.0449 ns | 0.0375 ns | 83 B |
GuidBitwiseEqualsMethod | 2.3840 ns | 0.0428 ns | 0.0400 ns | 83 B |
Here's my EqualsAlt
method:
public static bool EqualsAlt<G>(in T first, in G second)
where G : struct
{
if (UnsafeTools.SizeOf<T>() != UnsafeTools.SizeOf<G>())
return false;
switch (UnsafeTools.SizeOf<T>())
{
case 0:
return true;
case 1:
return UnsafeTools.AsInRef<T, byte>(first) == UnsafeTools.AsInRef<G, byte>(second);
case 2:
return UnsafeTools.AsInRef<T, short>(first) == UnsafeTools.AsInRef<G, short>(second);
case 4:
return UnsafeTools.AsInRef<T, int>(first) == UnsafeTools.AsInRef<G, int>(second);
case 8:
return UnsafeTools.AsInRef<T, long>(first) == UnsafeTools.AsInRef<G, long>(second);
default:
return Intrinsics.EqualsAligned(
ref UnsafeTools.AsInRef<T, byte>(first),
ref UnsafeTools.AsInRef<G, byte>(second),
UnsafeTools.SizeOf<T>()
);
}
}
Here's UnsafeTools
:
internal static class UnsafeTools
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int SizeOf<T>()
{
Sizeof(typeof(T));
return Return<int>();
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ref TTo AsInRef<TFrom, TTo>(in TFrom source)
{
Ldarg(nameof(source));
return ref ReturnRef<TTo>();
}
}
I've changed the benchmarks so they return the values, and also updated to BenchmarkDotNet 0.12.1 in order to get a better disassembly with [DisassemblyDiagnoser]
:
[Benchmark]
public bool GuidEqualsMethod()
=> NonEmptyGuid.Equals(default);
[Benchmark]
public bool GuidBitwiseEqualsMethod()
=> BitwiseComparer<Guid>.Equals<Guid>(NonEmptyGuid, default);
[Benchmark]
public bool GuidBitwiseEqualsAltMethod()
=> BitwiseComparer<Guid>.EqualsAlt<Guid>(NonEmptyGuid, default);
The difference comes from the fact that the JIT is able to emit a tail call to Intrinsics.EqualsAligned
in my version:
; DotNext.BitwiseComparer`1[[System.Guid, System.Private.CoreLib]].EqualsAlt[[System.Guid, System.Private.CoreLib]](System.Guid ByRef, System.Guid ByRef)
mov r8d,10
mov rax,offset DotNext.Runtime.Intrinsics.EqualsAligned(Byte ByRef, Byte ByRef, Int64)
jmp rax
movzx eax,al
ret
; Total bytes of code 23
Whereas it emits a normal call in yours:
; DotNext.BitwiseComparer`1[[System.Guid, System.Private.CoreLib]].Equals[[System.Guid, System.Private.CoreLib]](System.Guid ByRef, System.Guid ByRef)
sub rsp,28
mov r8d,10
call DotNext.Runtime.Intrinsics.EqualsAligned(Byte ByRef, Byte ByRef, Int64)
movzx eax,al
add rsp,28
ret
; Total bytes of code 23
I'm pretty sure you'd get the same results if you tweaked the IL a bit, but if the more readable version has the same perf, there's no real reason not to take it. ๐
Many thanks for such investigation!! I owe you a beer ๐บ I didn't use multiple returns due to absence of this optimization. Now I see that it's not a problem because just one branch survive after optimization.
You're welcome! ๐ I didn't do much BTW, I just stopped after the first iteration. ๐
Actually, I thought about it a bit more and noticed it was strange that there was no inlining happening. Then, I noticed that your benchmark code actually runs in Debug mode! ๐ฑ
See, you added WithCustomBuildConfiguration("Bench")
to the benchmark configuration, but BenchmarkDotNet builds its own assembly from a generated project, which does not know about the Bench
configuration, so you end up with the default Debug mode.
You can add WithArguments(new[] { new MsBuildArgument("/p:Optimize=true") })
to force it to optimize, or specify a condition for the Bench
configuration in a Directory.Build.props
file.
I'm actually getting worse results with that change, but I didn't look into it further just yet.
Hmm, WithCustomBuildConfiguration("Bench")
doing exactly what I expected. Otherwise, I would see warnings from DotNetBenchmark. Optimization flag for Bench
build config is described in csproj file.
Disassembly diagnoser is a luxury for me because I'm sitting on Linux and it is not supported for .NET Core, only for Mono.
Yes, I didn't notice it at first because I expected BDN to complain, but I guess it doesn't since you're basically doing a custom build. It writes X64 RyuJIT DEBUG
in the output though..
The disassembly diagnoser has been massively improved and is now cross-platform in BDN 0.12.1. ๐
Also I found that switch expression produces worse result rather than switch statement. In IL I found that Roslyn generates redundant temporary variables.
dotnet crashed with SIGABRT Unfortunately disassembler is not working, at least as easy as on Windows ๐ข
I guess you should report an issue then, as it was supposed to work. Maybe some configuration in your benchmarks is preventing it from working.
Yes, sure. I need to inspect crash dump before this. It takes some time.
Also, you was right about configuration. I fixed everything and found the explanation why optimized build demonstrates worse result. It was because of my unawareness about how to correctly write benchmarks. If benchmark checks something that has return value then benchmark method should have return type as well.
@ltrzesniewski , thanks a lot!