ltrzesniewski/InlineIL.Fody

Proposal: simplified arguments passing and first-class support of in parameters

sakno opened this issue ยท 27 comments

sakno commented

Hi Lucas!

I'm using InlineIL widely in my project .NEXT. Your add-in helping me a lot to write performance-critical code. However, I feel some inconvenience with it.
The first one is an absence of support in-modifiers. Let's take a look at this code:

 public static void Bitcast<T, TResult>(in T input, out TResult output)
            where T : unmanaged
            where TResult : unmanaged
        {
            const string slowPath = "slow";
            Ldarg(nameof(output));
            Sizeof(typeof(T));
            Sizeof(typeof(TResult));
            Blt_Un(slowPath);
            //copy from input into output as-is
            Ldarg(nameof(input));
            Ldobj(typeof(TResult));
            Stobj(typeof(TResult));
            Ret();

            MarkLabel(slowPath);
            Dup();
            Initobj(typeof(TResult));
            Ldarg(nameof(input));
            Ldobj(typeof(T));
            Stobj(typeof(T));
            Ret();
            throw Unreachable();    //output must be defined within scope
        }

I have to use Ldarg(nameof(input)) and Ldarg(nameof(output)) because there is no overloads with out and in parameters. This cause unwanted warnings from Roslyn about unused parameters.

The second issue is a method call. It's verbose. I like Push helper that allows to push result of parameterless method or property on the evaluation stack. However, this approach is not working for call sites when arguments are not locals or parameters. In this case I have to use Call with call site descriptor object. I would like to propose T Arg<T>() and ref T RefArg<T>() helper which allow to load argument from the evaluation stack:

Ldc_I4(10);
Ldc_I4(2);
Push(Math.Max(Arg<int>(), Arg<int>());
return Return<int>();

//equivalent to
ldc.i4 10
ldc.i4 2
call Math.Max(int, int)
ret

Additionally, it would be great if InlineIL will be able to recognize params correctly:

Ldstr("Hello, {0}!");
Ldstr("Albert Wesker");
Push(Console.WriteLine(Arg<string>(), Arg<string>()); //automatically creates object[] array in resulted IL code
Ret();

Hello!

About the in modifier:

I won't change Ldarg as I want the opcode methods to be mapped "cleanly" to what they do in IL, but I can enhance the helper methods in the IL class. IL.Push already has a ref overload, but changing that to an in would be a breaking change. I suppose I could add additional IL.PushIn<T>(in T value) and IL.PushOut<T>(out T value) helpers for this, so you could use them instead of Ldarg.

By the way, I'm not getting any warnings on your Bitcast example, are you talking about actual compiler warnings, or are you getting analyzer warnings in the IDE?

About method calls:

I actually wanted to have something like your proposed T Arg<T>() method, so I implemented a T IL.Pop<T>() method at some point, but I quickly realized that it didn't play well with the code that Roslyn generates. It was very unreliable, as I couldn't really depend on the position in the evaluation stack where the call gets inserted, especially when you began to call it inside larger expressions, so it could easily retrieve an unexpected value.

I replaced that with void IL.Pop<T>(out T value) (and some other void methods), which is much more reliable as it is necessarily in a single statement, so I could work with what Roslyn generates. But it requires a local variable or argument to output its value to.

I added a couple things to reduce the verbosity: if your method has no overloads, you don't need to specify any parameter type, and the single method will get picked up. Besides, I don't think the following is that bad:

Ldc_I4(10);
Ldc_I4(2);
Call(new MethodRef(typeof(Math), nameof(Math.Max), typeof(int), typeof(int)));
return Return<int>();

Can you point me to some code where this is a burden? Maybe I could figure something out.

sakno commented

By the way, I'm not getting any warnings on your Bitcast example, are you talking about actual compiler warnings, or are you getting analyzer warnings in the IDE?

It depends on the FxCop rule set. I'm using custom rule set which warns about unused parameters

I suppose I could add additional IL.PushIn<T>(in T value) and IL.PushOut<T>(out T value)

I'm ok with this.

so I implemented a T IL.Pop<T>()

Yes, it's unreliable. Anyway, when you using pure IL code you're on dangerous path by default. IMO, Arg<T>() just can be replaced with literally nothing when translating to IL because this instruction is just a placeholder. Therefore, translation and analysis are very simple. Another drawback of Call method is a loose of compile-time check for the method presence.

I replaced that with void IL.Pop<T>(out T value)

This way produces unnecessary local variable.

Of course, the original proposal can be modified. For instance, Arg placeholder can be marked with EditorBrowsable attribute to avoid highlights from IDE. Or you can name it as ArgUnsafe to highlight unsafe and unreliable nature of this method. Or just place it to another class named as UnsafeIL or even to different inner class or namespace with Pro, Unsafe words in the name.

sakno commented

One more thing: what about to add generic versions of ldobj, stobj, sizeof etc? It will be less verbose than using typeof keyword. Of course, overloaded methods with Type parameter should still exist to cover ref value types.

Oh, ok so that's FxCop. I suppose IL.PushIn and IL.PushOut would help then. I just added an implementation (it's not released yet).

IMO, Arg<T>() just can be replaced with literally nothing when translating to IL because this instruction is just a placeholder.

That's exactly the implementation of IL.Push, and the former implementation of IL.Pop ๐Ÿ˜‰

Another drawback of Call method is a loose of compile-time check for the method presence.

You don't lose it, InlineIL checks for the method presence and fails the build if it's not there.

Of course, the original proposal can be modified. For instance, Arg placeholder can be marked with EditorBrowsable attribute to avoid highlights from IDE. Or you can name it as ArgUnsafe to highlight unsafe and unreliable nature of this method. Or just place it to another class named as UnsafeIL or even to different inner class or namespace with Pro, Unsafe words in the name.

Yes, that would be a very unsafe method indeed. It would only work if you call a method where all the parameters are Arg<T>().

Honestly I wouldn't be comfortable using it, ever. I'm not even comfortable providing something like that in the API. Unless maybe if I add a check for correct usage (which would make sure you use it for all of the parameters of a method call).

One more thing: what about to add generic versions of ldobj, stobj, sizeof etc? It will be less verbose than using typeof keyword.

I could add that. I'm not sure I'll like the syntax though, it may not feel very consistent with the existing methods.

sakno commented

For Arg<T> I can show you actual example. Here is my code. A little bit verbose, isn't? Arg<T> is perfect in this case because arguments are on the evaluation stack, not in parameters or locals. Moreover, I need to reinterpret managed pointer type (which is allowed by ECMA). Here I'm fully understand what is going on and safety is my problem. Additionally, I'm always inspecting generated IL code with ikdasm.

sakno commented

Also, I see obvious usefulness for ldftn, ldvirtftn, ldtoken. With Arg<T> it's possible to reference the method without MethodRef.

Ok, here are some thoughts about your Compare<G> method, and how to make it shorter:

  1. You can replace new TR(typeof(byte)).MakeByRefType() with typeof(byte).MakeByRefType(). You're not using any TypeRef feature here, so you don't need it.

(As an aside: I kind of regret introducing constructors for TypeRef, MethodRef etc, as I don't like having that new keyword everywhere, but now we're stuck with it. I may add more static factories on these classes later on.)

  1. To me it looks like you've unnecessarily written the following part in IL:
            Sizeof(typeof(T));
            Pop(out uint size);
            Push(size);
            Sizeof(typeof(G));
            Ceq();
            Dup();
            Brfalse(methodExit);
            Pop();

This could have been replaced with C# code like:

if (Unsafe.SizeOf<T>() == Unsafe.SizeOf<G>())

The JIT will always optimize a call to Unsafe.SizeOf<T>() to the equivalent sizeof instruction (especially since the method is flagged for aggressive inlining), and it will evaluate the condition as a constant at JIT time and eliminate one of the branches. You can of course embed the SizeOf method in your assembly. Anyway the JIT will produce the same output.

  1. sizeof is basically free - it's always constant, so there's no need to store it in a local variable.

  2. Similarly, with a custom helper method, you could write something like:

return Intrinsics.Compare(
    ref Unsafe.AsInRef<T, byte>(first),
    ref Unsafe.AsInRef<G, byte>(second),
    Unsafe.SizeOf<T>()
);

You'd need to write an equivalent to Unsafe.As which takes an in parameter instead of ref, something like:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ref TTo AsInRef<TFrom, TTo>(in TFrom source)
{
    Ldarg(nameof(source));
    return ref IL.ReturnRef<TTo>();
}

I'd argue this would be more readable than:

Ldarg(nameof(first));
Ldarg(nameof(second));
Sizeof(typeof(T));
Conv_I8();
Push(Intrinsics.Compare(ref ArgRef<byte>(), ref ArgRef<byte>(), Arg<long>())

Also, I see obvious usefulness for ldftn, ldvirtftn, ldtoken. With Arg<T> it's possible to reference the method without MethodRef.

I'm not sure I understand what you mean here.

One more thing: what about to add generic versions of ldobj, stobj, sizeof etc? It will be less verbose than using typeof keyword.

I added that. I actually like this syntax. ๐Ÿ‘

sakno commented

and it will evaluate the condition as a constant at JIT time and eliminate one of the branches.

Generally, it's not always working in this way. Compare cannot be inlined by itself. As a result, sizeof will ask runtime to return the size of the generic argument. Generics are actually passed as hidden arguments as described in .NET ABI. So it's constant value at runtime, but not in assembly code generally. To solve this issue I have to wait for this.

I'm not sure I understand what you mean here.

I meant this:

Ldtoken(Console.WriteLine(Arg<string>(), Arg<object[]>());

But I forgot that many methods have void return type.

sakno commented

I agree that my code can be rewritten with pure C#. However, I saw somewhere in the open issues for CLR that AggressiveInlining may be ignored in some circumstances (even if the size of IL code is less than 36 bytes). This is like inline in C which recommends the compiler to inline the method body at call site but not enforce this. That's why GCC has extension in the form of always_inline attribute.

The issue you linked does not apply to this case: the T and G type parameters of Compare are constrained to struct types, which means there won't be native code sharing: the JIT will generate a different native method for each combination of type arguments. You're in the best possible scenario for the JIT to optimize out all branches except one (at least if you had written switch (Unsafe.SizeOf<T>()), I'm not sure if the JIT will be able to optimize out switch (size) because of the variable).

Even if the Compare method could handle reference types, and therefore the JIT emitted shareable native code, sizeof would still yield a constant, as it returns the pointer size when asked for the size of any reference type, and struct types get their own native code anyway.

Also, I wouldn't worry too much about AggressiveInlining being ignored in this case: they made a ton of optimizations in .NET Core 3.0 by using Unsafe in many hot paths. These methods are very simple, and none of them gets inlining blocked, it would be a major regression if that happened. Of course, if you're still unsure, you can take a look at the native code emitted by the JIT.

sakno commented

Let's recap our discussion:

  • PushIn and PushOut are mostly expected methods
  • Arg<T> is nice to have. Some API design work required.

Yes, PushIn and PushOut are implemented. But I think I'll rename them to PushInRef and PushOutRef as the current names seem confusing.

Overloads like Stobj<T>() instead of Stobj(TypeRef) are also implemented.

As for Arg<T>... I don't think that's a good idea. ๐Ÿ˜‰

sakno commented

You decide, you're an author of this add-in ๐Ÿ˜‰

Interesting fact: I tested Compare using my benchmarks when it's fully re-implemented without InlineIL using SizeOf<T> and can tell that it's a slower than roundtrip with local variable. I think that JIT unable to apply constant propagation in combination with switch table.

Yes, that's interesting... I wrote a quick SharpLab test which seems to show that Roslyn always goes through a local variable when using the switch IL instruction, and maybe the JIT optimizes for that. The native code is always optimized to a single branch in that case.

But that means your rewritten code still should use a local variable for the switch... Could you send me that benchmark please, if you still have it? I'd be interested in playing a bit with it.

sakno commented

Yup, benchmark result for the current implementation: https://sakno.github.io/dotNext/benchmarks.html
Benchmark code: https://github.com/sakno/dotNext/blob/master/src/DotNext.Benchmarks/BitwiseEqualityBenchmark.cs
You can run benchmark project using dotnet run -c Bench. Bench configuration is the same as Release but doesn't publish packages for NuGet.
The main loss is in GuidBitwiseEqualsMethod bench method. The current implementation can save 4-5 nanos (which is x2) in comparison with the implementation using Unsafe.SizeOf<T>() and other helper methods you mentioned above.

Another proof that constant propagation will not lead to inlining of Equals or Compare is EnumEqualsUsingBitwiseComparer bench. In case of constant propagation, the method should be optimized to a few assembly instructions and have the same performance as EqualityComparer.Default but it's not.

Thanks, I'll take a look at this (maybe tomorrow).

Just a quick note about EqualityComparer<T>.Default: don't expect to be able to beat it for things like enums. It's a devirtualized JIT intrinsic. ๐Ÿ˜„

Compare this demo in .NET Core and .NET Framework targets and see the difference.

I took a quick look and I actually got better results with my implementation:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.720 (1909/November2018Update/19H2)
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.200
[Host] : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Job-BSBVUK : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT DEBUG

BuildConfiguration=Bench LaunchCount=1 RunStrategy=Throughput

Method Mean Error StdDev Code Size
GuidEqualsMethod 0.9211 ns 0.0263 ns 0.0220 ns 140 B
GuidBitwiseEqualsAltMethod 1.9558 ns 0.0449 ns 0.0375 ns 83 B
GuidBitwiseEqualsMethod 2.3840 ns 0.0428 ns 0.0400 ns 83 B

Here's my EqualsAlt method:

public static bool EqualsAlt<G>(in T first, in G second)
    where G : struct
{
    if (UnsafeTools.SizeOf<T>() != UnsafeTools.SizeOf<G>())
        return false;

    switch (UnsafeTools.SizeOf<T>())
    {
        case 0:
            return true;

        case 1:
            return UnsafeTools.AsInRef<T, byte>(first) == UnsafeTools.AsInRef<G, byte>(second);

        case 2:
            return UnsafeTools.AsInRef<T, short>(first) == UnsafeTools.AsInRef<G, short>(second);

        case 4:
            return UnsafeTools.AsInRef<T, int>(first) == UnsafeTools.AsInRef<G, int>(second);

        case 8:
            return UnsafeTools.AsInRef<T, long>(first) == UnsafeTools.AsInRef<G, long>(second);

        default:
            return Intrinsics.EqualsAligned(
                ref UnsafeTools.AsInRef<T, byte>(first),
                ref UnsafeTools.AsInRef<G, byte>(second),
                UnsafeTools.SizeOf<T>()
            );
    }
}

Here's UnsafeTools:

internal static class UnsafeTools
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static int SizeOf<T>()
    {
        Sizeof(typeof(T));
        return Return<int>();
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static ref TTo AsInRef<TFrom, TTo>(in TFrom source)
    {
        Ldarg(nameof(source));
        return ref ReturnRef<TTo>();
    }
}

I've changed the benchmarks so they return the values, and also updated to BenchmarkDotNet 0.12.1 in order to get a better disassembly with [DisassemblyDiagnoser]:

[Benchmark]
public bool GuidEqualsMethod()
    => NonEmptyGuid.Equals(default);

[Benchmark]
public bool GuidBitwiseEqualsMethod()
    => BitwiseComparer<Guid>.Equals<Guid>(NonEmptyGuid, default);

[Benchmark]
public bool GuidBitwiseEqualsAltMethod()
    => BitwiseComparer<Guid>.EqualsAlt<Guid>(NonEmptyGuid, default);

The difference comes from the fact that the JIT is able to emit a tail call to Intrinsics.EqualsAligned in my version:

; DotNext.BitwiseComparer`1[[System.Guid, System.Private.CoreLib]].EqualsAlt[[System.Guid, System.Private.CoreLib]](System.Guid ByRef, System.Guid ByRef)
       mov       r8d,10
       mov       rax,offset DotNext.Runtime.Intrinsics.EqualsAligned(Byte ByRef, Byte ByRef, Int64)
       jmp       rax
       movzx     eax,al
       ret
; Total bytes of code 23

Whereas it emits a normal call in yours:

; DotNext.BitwiseComparer`1[[System.Guid, System.Private.CoreLib]].Equals[[System.Guid, System.Private.CoreLib]](System.Guid ByRef, System.Guid ByRef)
       sub       rsp,28
       mov       r8d,10
       call      DotNext.Runtime.Intrinsics.EqualsAligned(Byte ByRef, Byte ByRef, Int64)
       movzx     eax,al
       add       rsp,28
       ret
; Total bytes of code 23

I'm pretty sure you'd get the same results if you tweaked the IL a bit, but if the more readable version has the same perf, there's no real reason not to take it. ๐Ÿ˜‰

sakno commented

Many thanks for such investigation!! I owe you a beer ๐Ÿบ I didn't use multiple returns due to absence of this optimization. Now I see that it's not a problem because just one branch survive after optimization.

You're welcome! ๐Ÿ˜„ I didn't do much BTW, I just stopped after the first iteration. ๐Ÿ˜‰

Actually, I thought about it a bit more and noticed it was strange that there was no inlining happening. Then, I noticed that your benchmark code actually runs in Debug mode! ๐Ÿ˜ฑ

See, you added WithCustomBuildConfiguration("Bench") to the benchmark configuration, but BenchmarkDotNet builds its own assembly from a generated project, which does not know about the Bench configuration, so you end up with the default Debug mode.

You can add WithArguments(new[] { new MsBuildArgument("/p:Optimize=true") }) to force it to optimize, or specify a condition for the Bench configuration in a Directory.Build.props file.

I'm actually getting worse results with that change, but I didn't look into it further just yet.

sakno commented

Hmm, WithCustomBuildConfiguration("Bench") doing exactly what I expected. Otherwise, I would see warnings from DotNetBenchmark. Optimization flag for Bench build config is described in csproj file.

Disassembly diagnoser is a luxury for me because I'm sitting on Linux and it is not supported for .NET Core, only for Mono.

Yes, I didn't notice it at first because I expected BDN to complain, but I guess it doesn't since you're basically doing a custom build. It writes X64 RyuJIT DEBUG in the output though..

The disassembly diagnoser has been massively improved and is now cross-platform in BDN 0.12.1. ๐Ÿ˜‰

sakno commented

Also I found that switch expression produces worse result rather than switch statement. In IL I found that Roslyn generates redundant temporary variables.

sakno commented

dotnet crashed with SIGABRT Unfortunately disassembler is not working, at least as easy as on Windows ๐Ÿ˜ข

I guess you should report an issue then, as it was supposed to work. Maybe some configuration in your benchmarks is preventing it from working.

sakno commented

Yes, sure. I need to inspect crash dump before this. It takes some time.

Also, you was right about configuration. I fixed everything and found the explanation why optimized build demonstrates worse result. It was because of my unawareness about how to correctly write benchmarks. If benchmark checks something that has return value then benchmark method should have return type as well.

sakno commented

@ltrzesniewski , thanks a lot!