sampsyo/cs6120

Project 01 Proposal: Holistic review on the number of vector registers and the performance of execution of vector code.

SocratesWong opened this issue · 14 comments

What will you do?
This project aims to explore the performance tradeoff for the number of logical vector registers and performance in executing vector based benchmark. This project is motivated in PIM based vector engines, which has difference area tradeoff compare to SIMD vector registers.
How will you do it?
By modifying the parameters of LLVM to generate RSIC-V code with different amount of vector registers, This project aims to determine the tradeoffs between a similar architecture with different amounts of vector registers.
How will you empirically measure success?
Success will be measured in reporting the trade off in performance and the number of registers available. The eventual goal would be to create an optimizing complier to determine the number of registers are needed, but such ambitious task may or may not be in the scope of the project
Team members:
@SocratesWong

Sounds good! Here is a question to answer sooner rather than later, before diving too deeply into the project: where will your benchmarks come from? I am not aware of many benchmarks that use RISC-V vector intrinsics, and auto-vectorization typically does not yield great code. So maybe something with the generic GNU vector intrinsics, if that suffices?

I currently have access to a benchmark suite that was manually vectorized using intrinsic for associative computing. But I might not be able to open source the benchmark without permission of the author.

OK! But just to check, would "intrinsic for associative computing" work when compiling for the RISC-V vector extensions? Or would you need different annotations for that?

The intrinsic used for the associative compution are the RISC-V ISA vector extensions. Here is an example of the inner loop of matrix multiply:

vmul    ("v1", "v2", "v3"); 
vredsum ("v2 , "v1");
vextract(val, "v2", "x0");
matrix_out[matrix_len*i + j] += val;

Example vvadd declarations

#define vvadd( out, in0, in1) {  \
   __asm__ __volatile__(           \
   vvadd" "out", "in0", "in1" \n\t" \
   );                              \
 }

OK, interesting! I guess now I am concerned about a different thing, however: this code explicitly addresses vector registers. So I don't think it's possible for the compiler to allocate different numbers of registers. Is that right? Or am I misunderstanding and v1, v2, etc. are C-level variables rather than assembly-level registers?

The commonly accepted model of ISA abstraction involves architectural and physical registers. In a modern out of order processor, the architectural registers have no relationship on physical registers itself, but an abstraction to allow the programmer to program more effectively.
With the currently implantation v1, v2 are mapped to the vector registers v1 and v2, but there is nothing stopping it to map to the physical registers v11 and v12 (or any other of the general vector registers), And if the compiler outputs code with v11 and v12, it should execute correctly as intended.

Sure, but the compiler doesn't have control over the mapping from architectural to physical registers. So you can't remap those in LLVM without a lot of trouble. I think maybe we need to talk more about this?

If that is the case, then i think will need to. I was thinking in assembly you are allow to use either variables or registers and the complier will take care of it. But I think i can see the potential issue with it, if you can't remap it in LLVM.

No. Clang does not change your inline assembly. (Nor does any other C compiler.) Inline assembly is a way to bypass the compiler.

One option might be to use benchmarks that use the GNU vector extensions or target-specific vector intrinsics that do not rely on the programmer to do register allocation by hand. I don't know what the current state of RISC-V vector instruction support in LLVM is, but you can look into it. Or you could switch to using plain old x86 AVX-512, for example.

What about having a post processor and preforming dataflow analysis on the output RISC-V assembly?

Could work! Two things to be aware of:

  • It would mean doing everything from scratch. You would not be able to take advantage of existing LLVM stuff—sort of like when you write for Bril.
  • Analyzing and transforming assembly directly is generally quite hard. Keeping offsets and addresses consistent, for example, can be next to impossible when you change the code. So you would definitely need to work on a restricted subset.

So if you do this, I think it's important to start very early and prototype something basic. Be sure you get a clear idea of how hard changing the vector assembly to insert spills will be, for example, before getting too deep into the implementation.

I have been looking at the topic on intrinsics , and I think a subset of the code can be ported to risc-v vector intrinsic. If I can do that, do you think it will make it possible to reuse existing LLVM infrastructure?
https://github.com/riscv/rvv-intrinsic-doc

Here's the question to answer: when you program in that style, are you using proper C-level variables, rather than specific register indices? In other words, is the compiler responsible for doing register allocation? (Recall that your proposal is to change the compiler's register allocation algorithm, so that had better be involved.)

It is also important, of course, to check whether the compiler you want to modify has support for those intrinsics. Does LLVM support them, for example? If not, which other compiler does?