Gauging Interest in: Parallel Compute Instruction Set, Vector Processing Unit, or GPU-lite interface.

Question

Gauging Interest in: Parallel Compute Instruction Set, Vector Processing Unit, or GPU-lite interface.

RobDavenport opened this issue 8 months ago · 1 comments

Hey all,

I was looking at some interesting things about 5th generation consoles like the n64 and PlayStation. They had special hardware built for parallel data processing, similar to a modern GPU.

Is there any interest in a series of special processing instructions? Currently, the wasm standard and runtime support 128bit SIMD, so 4x 32bit integers, or 4x f32 floats at once. Modern CPUs of today have 256bit and even 512bit operations, but that pales in comparison to the n64's gpu. We're looking at maybe 256bit over 4-8 cores (for a modest device), which is still 32 to 64 instructions "at once."

According to this source, the n64 had 32x 128bit wide Vector Processing Unit. That's 128 f32 operations at once (4xf32 * 32 vectors). The ps1 also had a special geometry transformation engine. I can't find exact specs on this, but I assume it's similar to a vertex shader in the modern GPU pipeline, as it worked together with the actual PS1 GPU to draw pixels to the screen.

I think that this can be accomplished a few different ways:

Method (1) Exposing a few raw, simple large vector operations which can be run across multiple threads in the host machine

The lowest level. It would be similar to writing actual manual simd code like this. Could potentially have to add some additional fields for pointers to read/write variables into memory. Not very easy to do, and with the added complexities of passing larger values around between wasm & host this could be really annoying. Plus all the individual calls (load values, add values, multiply values, and pushing them in-and-out of the memory locations) might just kill any real performance benefit. Perhaps some kind of "Execution Buffer" could be written to execute many instructions and reduce the module<-->host call count.

Pros:

Safe and not easy to desync.
Real optimization work specific to the console itself.
Everything in WASM.

Cons:

Very tedious to use as a developer
Might not even be that benefecial performance wise due to the heavy increase in calls between the module and the host.
More stress on CPU, albiet lessened by pushing some work off to idle threads.

Method (2) Exposing some kind of "fork-join" API within the console, and being able to define some code to call in parallel over a dataset.

This keeps everything in WASM land, but it would also be super easy to cause a desync or worse if this isn't handled correctly. WASM doesn't support read-only memory, so it would be very easy to access memory outside of the expected region. Honestly I think this is the best choice but it passes so much responsibiliy onto the developer I'm not sure if we want to open up this risk for easy desyncing. But when done properly I could see this being a really fun and powerful thing to experiment with, and I expect an API similar to CUDA could exist here. But it's easy to imagine how unsafe this could be...

Pros:

Everything in WASM
Simpler exposure to multithreading compared to actual threads in C or Rust

Cons:

Super easy to desync and fail silently
No way to enforce safety across parallel calls

Method (3) Exposing compute shaders, or a simplified GPU pipeline accessable and configurable from the game itself.

This is kinda the "easy way out." Would require devs to either write their own shaders, which could be typical vertex, fragment, or compute shaders. Alternatively, gamercade could provide a subset of shaders/shaders-as-functions to be called over a dataset. This method is still quite open-ended, as I'm personally not too experienced with modern graphics APIs like wgpu or vulkan.

Pros:

Potential exposure to writing shaders and actual GPU code.
Real performance increases, unlike the others which heavily abuse the CPU.
Make additional use of the GPU, which is currently only used for pixel-perfect scaling.

Cons:

GPUs today are super-duper powerful and this could easily warp the entire project if not done carefully.
The developers who can make use of this feature may as well write their own 3d game engines 😂

Method (4) Something else...

Those three methods listed above aren't completely researched and I'm sure there are ways to implement some of them to solve the larger issues. For example, there is a wasm2spirv crate which compiles WASM code into shader code, which could greatly benefit method 3. I'm personally a big fan of method 2, but the lack of safety and prevention of desyncing makes me hesitant.

Answer 1 · 2024-04-14T16:04:41.000Z

Closing this in favor of upcoming 3d discussion post.