Block level SLP auto-vectorization

Question

Block level SLP auto-vectorization

Closed this issue a year ago · 1 comments

The SLP approach seems to allow for more flexibility than the standard inner loop transform, and although it is relatively more complicated to implement, the infrastructure could be reused for loop vectorization with relatively little effort.

Aliasing seems to be a significant blocker to auto-vec. Because variables may refer to the same memory region at runtime, it's not trivial to determine whether the vectorized code is legal (e.g. in the presence of serial/dependent operations). A simple solution for this is to insert runtime aliasing guards, but that may only be reasonable for loops.
Additionally, it would be nice to have an intrinsic API like OptimizerHints.AsRestrict(T).

TODOs

SLP vectorizer:
- Initial implementation
- Recognize store seeds
  - Pointers only. In the future we may want to consider consider struct field accesses (esp. for equality methods)
- Recognize reduction seeds (is this even worth outside loops?)
- Vector code generation
- Basic cost estimation
- Legalization (mem aliasing)
Operations
- Pointer loads
- Basic binops (arithmetic and bitwise)
  - Handle implicit widening (e.g. conv.u16/store.u16 (binop x, y -> i32))
- Math.{Min, Max, Abs, Sqrt, Floor, Ceiling}
  - ~~Round, Fmadd~~ (no xplat intrinsics for these)
- Comparison and selects
- Conversions
  - Handling widening/narrowing seems to be complicated
What else?

Extras:

Basic if-conversion
- Code gen support for SelectInst
- Handle non-diamond graphs through path duplication (for empty blocks only)
Basic loop unrolling (single block only)
- We probably want to tune it better around some caried dep heuristic.
Consider introducing a GetElementPtr/LEA instruction, because recognizing indexing expressions is tricky. Having this could also help consolidation of load/store instructions for arrays and fields.
Consider first-class support for vector types in the IR. This may not be that valuable outside of bringing the ability to perform basic peepholes.

Research

Sample cases where it would ideally be applied

Typical SLP target:

//Ideally, these pointers would have concrete aggregate types such as Vector4 and Matrix4x4,
//but handling fields should be done separately from the initial implementation.
public static void SLP_MatrixTransform(float* pos, float* mat, int count) {
    //This could be made redundant by making SLP aware of the loop, so it could clone and guard the vectorized version
    pos = OptimizerHints.AsRestrict(pos);
    
    for (int i = 0; i < count; i++) {
        float x = pos[0], y = pos[1], z = pos[2], w = pos[3];
        
        pos[0] = x * mat[0] + y * mat[4] + z * mat[ 8] + w * mat[12];
        pos[1] = x * mat[1] + y * mat[5] + z * mat[ 9] + w * mat[13];
        pos[2] = x * mat[2] + y * mat[6] + z * mat[10] + w * mat[14];
        pos[3] = x * mat[3] + y * mat[7] + z * mat[11] + w * mat[15];
        
        pos += 4;
        mat += 16;
    }
}
//Ideal output
public static void SLP_MatrixTransform(float* pos, float* mat, int count) {
    for (int i = 0; i < count; i++) {
        float x = pos[0], y = pos[1], z = pos[2], w = pos[3];
        
        //It shouldn't really matter whether we use custom operators or static intrinsics (op_Addition vs Vector128.Add),
        //though the former would make decompiled code significantly easier to read.
        //Load()/LoadUnsafe() methods should also have no significant difference at the IL level,
        //but using the most common methods could help minimize generic instantiations in the final module.
        var res =
            Vector128.Create(x) * Vector128.Load(&mat[0]) +
            Vector128.Create(y) * Vector128.Load(&mat[4]) +
            Vector128.Create(z) * Vector128.Load(&mat[8]) +
            Vector128.Create(w) * Vector128.Load(&mat[12]);
        Vector128.Store(res, &pos[0]);
        
        pos += 4;
        mat += 16;
    }
}

A reasonable reduction loop:

//Original
int CountLetters(string text) {
    return text.Count(ch => ch is >= 'A' and <= 'Z');
}

//After LinqExpansion, DevirtLambdas, and Inlining
int CountLetters(string text) {
    char& start = text.GetPinnableReference();
    char& end = start + text.Length;
    int count = 0;

    for (; start < end; start++) {
        char ch = *start;
        if (ch >= 'A' && ch <= 'Z') {
            count++;
        }
    }
    return count;
}

//After SimplifyCFG* (If Conversion)
count += (ch >= 'A' & ch <= 'Z');

//After SimplifyInsts*
count += (uint)(ch - 'A') < 26;

//After LoopVectorization*
int CountLetters(string text) {
    char& start = text.GetPinnableReference();
    char& end = start + text.Length;
    int count = 0;
    
    //Ideally the condition would be `start <= (end - 16)`, but that's a GC hole for `length < 16`
    for (; end - start >= 16; start += 16) {
        //Problem: implicit binop widening
        //  conv.k (binop x, y)     safe to narrow binop if both x and y are of type k  (it may be possible to propagate this info upwards for some ops)
        //  ult(x - C1) < C2        safe to narrow if x >= 0 && C2 < RangeMax(x) - C1
        var t1 = Vector256.Subtract(Vector256.LoadUnsafe((ushort&)start), Vector256.Create((ushort)'A'));
        var t2 = Vector256.CompareLessThan(t1, Vector256.Create((ushort)26));
        //Problem: widening reductions
        //  v_count += t2       this could overflow, also incorrect for floats
        count += BitOperations.PopCount(t2.ExtractMostSignificantBits());
    }
    for (; start < end; start++) {
        count += (uint)(*start - 'A') < 26;
    }
    return count;
}

Answer 1 · 2023-06-19T23:26:11.000Z

Closing this for now as I have no more interest nor energy to work on it.