dubiousconst282/DistIL

Invalid optimization for `Enumerable.Range(...).ToArray()`

Opened this issue · 4 comments

First and foremost, thank you for writing this library. This is such a cool project. I have been impressed by the length it goes to optimize LINQ and the way it does it by acting like a proper compiler rather than a simple macro rewrite.

However, quickly evaluating the logic has shown an issue with one of the optimizations:

public int[] ToArray()
{
	return Enumerable.Range(0, Length).ToArray();
}

// Same source code as above
[Optimize]
public int[] ToArrayOpt()
{
	int length = Length;
	if (!(length < 0 || (ulong)length > 2147483647uL))
	{
		List<int> list = new List<int>(length);
		for (int i = 0; i < length; i++)
		{
			list.Add(i);
		}
		return list.ToArray();
	}
	throw new ArgumentOutOfRangeException();
}

This misses that
a) RangeIterator.ToArray() has historically taken advantage of knowing its length when materializing to array or list
b) .NET 8 has gained the CollectionsMarshal.SetCount(list, count) method and uses it heavily together with CollectionsMarshal.AsSpan() to optimize LINQ internals that deal with Lists
c) RangeIterator.Fill(span) which is used to initialize ToArray and ToList has been vectorized

If you don't have time to address this issue for the time being, I might manage to get to contributing a fix after familiarizing myself with this project's internals (after finishing another project). Opening this issue nonetheless for now to not forget about it later or in case you do have time to actualize the implementation to account for recent improvements. Thanks again!

Thanks for your interest and kind words :)

I believe the first two issues have been addressed by 76f1345, but I haven't updated the NuGet package since (will do shortly). Lists are covered as well through access to the private fields directly (thanks to IgnoreAccessChecksToAttribute). I think it's quite safe to do that, but these could be changed to CollectionsMarshal and other calls with a bit more effort (though there are some other places that depend on this as well).

Here's the output from the latest commit (it's not perfect either but presumably better):

    public int[] RangeToArray(int length)
    {
        if (!(length < 0 || (ulong)length > 2147483647uL))
        {
            int[] array = new int[length];
            int num = 0;
            int num2 = 0;
            while (num < length)
            {
                int num3 = num2 + 1;
                array[num2] = num;
                num++;
                num2 = num3;
            }
            return array;
        }
        throw new ArgumentOutOfRangeException();
    }
    
    public List<int> RangeToList(int length)
    {
        if (!(length < 0 || (ulong)length > 2147483647uL))
        {
            int[] array = new int[length];
            int num = 0;
            int num2 = 0;
            while (num < length)
            {
                int num3 = num2 + 1;
                array[num2] = num;
                num++;
                num2 = num3;
            }
            List<int> list = new List<int>();
            list._items = array;
            list._size = num2;
            return list;
        }
        throw new ArgumentOutOfRangeException();
    }

But considering issue C, it might be best to disable this transformation entirely for Range().ToArray(), by adding a check here (it should be quite trivial to do, I'll check before updating the package):

private static bool IsProfitableToExpand(LinqSourceNode source, LinqSink sink)
{
//Unfiltered Count()/Any() is not profitable because we scan the entire source.
if (sink.SubjectCall is { NumArgs: 1, Method.Name: "Count" or "Any" }) {
return false;
}
//Concretizing enumerator sources may not be profitable because
//Linq can special-case source types and defer to e.g. Array.Copy().
//Similarly, expanding an enumerator source to a loop sink is an expansive no-op.
if (source is EnumeratorSource && source.Drain == sink) {
return sink is not (ConcretizationSink or LoopSink);
}
return true;
}

A note on applying this opt. - since the biggest speed-up is relevant to .NET 8 only, I believe it still will be more efficient to keep range initialization as is, provided the final IL takes out _items as you noted and reimplements ToArray() (for both array and list with privately accessed field).

And lastly, the vectorized loop shape itself is a bit conservative in runtime (because it is in LINQ, having vectorization in any form in which is already a luxury), so DistIL could do better both at short range init by skipping iterator allocation and by producing a loop that adds and stores 2 (max 256b) vectors per cycle + applies alignment for long lengths (due to penalty on ARM64).

All in all, I'm not sure if this is worth the effort given that Range(...).ToArray() is quite common but most of the time in tests. Also I don't know what is the policy of DistIL regarding the differences in optimizations (sometimes significant) between TFMs - does it assume best effort for the latest version of .NET with previous targets not being compatible or regressed or can the changes be gated behind TFM checks? (it's kind of a pain to track these but it's a nice problem to have)

Right, I totally dismissed the iterator allocations and virtual stuff there, so it's now a missed optimization for short ranges...

I think it would be a bit difficult to do vectorization inside the LINQ expansion transform, because the pipeline representation and code generator are based around individual elements like in the IEnumerable interface. There are several problems with the dedicated vectorization pass (it's basically a quick hack), but I think the only blockers for this particular case are the duplicated counters and the loop check not being against the array length. Ideally these things would be handled by other generic passes, but plain pattern matching could probably get it there easier in short term.

There are no assumptions or tuning for specific TFMs or any kind of backwards compat implemented currently, but indeed they'd be quite nice to have, at least for the few most recent and supported runtimes. I had plans to implement global assumption flags for things like hardware targets (64-bit archs, AVX/NEON, little-endianess, and others), so this would not be beyond the scope.

FWIW though, I think really aggressive optimizations are fine as long as they don't wreck or slowdown everything too much, but it's fine too if some things are left unchanged like in this case - I'm just not sure either if this is worth the effort considering that it's a relatively niche method and all the other problems in hand :')

I managed to get a basic transform to remove the duplicated loop counters working, now the vectorizer is working about as expected for Range(). Skip/Take (and maybe other stages) could take advantage of this too, but they're currently counting from the start/end within a branch, and there's potentially overflowing to worry about, so I didn't bother changing them to use canonical counters.

Also added another heuristic for Range().ToXX() to enable expansion when the range is constant and <= 64. This seems to be a reasonable threshold when looking at your benchmarks, but I haven't really tested it. Having loop unrolling + constant folding for vectors would probably make it fully optimal (or almost)...

public static int[] RangeConst()
{
    return Enumerable.Range(0, 16).ToArray();
}
public static int[] RangeDyn2(int count)
{
    int j = 0;
    return Enumerable.Range(0, count).Select(x => x * 5 + j++).ToArray();
}

public static int[] RangeConst()
{
    int[] array = new int[16];
    ref int arrayDataReference = ref MemoryMarshal.GetArrayDataReference(array);
    for (int i = 0; i < 9; i += 8)
    {
        Unsafe.WriteUnaligned(
            ref Unsafe.As<int, byte>(ref Unsafe.Add(ref arrayDataReference, i)),
            Vector256.Create(i) + Vector256.Create(0, 1, 2, 3, 4, 5, 6, 7));
    }
    return array;
}
public static int[] RangeDyn2(int count)
{
    if (!(count < 0 || (ulong)count > 2147483647uL))
    {
        int[] array = new int[count];
        ref int arrayDataReference = ref MemoryMarshal.GetArrayDataReference(array);
        int num = count - 7;
        int i;
        for (i = 0; i < num; i += 8)
        {
            Unsafe.WriteUnaligned(
                ref Unsafe.As<int, byte>(ref Unsafe.Add(ref arrayDataReference, i)), 
                (Vector256.Create(i) + Vector256.Create(0, 1, 2, 3, 4, 5, 6, 7)) * Vector256.Create(5) + 
                (Vector256.Create(i) + Vector256.Create(0, 1, 2, 3, 4, 5, 6, 7)));
        }
        while (i < count)
        {
            num = i + 1;
            Unsafe.Add(ref arrayDataReference, i) = i * 5 + i;
            i = num;
        }
        return array;
    }
    throw new ArgumentOutOfRangeException();
}