mn416/QPULib

General Discussion

Opened this issue · 53 comments

This is an issue for discussing general things.

With respect to command line handling for Rot3DLib:

  • Even the 'lite' version of the code is pretty hefty; It's going to add 6 classes to Rot3DLib (the current 'full' version has 11, with more code in them). However, I'm planning to add a lot of options for the Rot3DLib app, and I believe a robust handling of the command line is important. So please bear with me.

  • I do regard the command line handling as a separate project, of which a snapshot is added to the QPULib project. It will be released with an MIT license (I'll change the QPULib license for that).

I hope you can approve of this.

@mn416 Upon checking in, I happened to notice the branch explicit-vpm and of course I had to peek.

Looks like you're enabling VPM and DMA. Now, that is cool.
However, it also looks like you're removing the previous code for direct memory access. Is this correct? If so, please don't do that. It has its uses.

mn416 commented

Hi @wimrijnders,

I hope you can approve of this.

Yes, sounds good.

However, it also looks like you're removing the previous code for direct memory access. Is this correct? If so, please don't do that. It has its uses.

The new version should be completely backwards compatible with the current stable version (when it works, almost there). However, the new DMA and VPM functions can break the deferencing operator when they are used. I will have to document this, but basically programs that use explit VPM/DMA should not use the dereferencing operator (we can add a compiler check to make sure this constraint is met).

mn416 commented

By the way, I had to disable the platform detection code because it said my Pi was not a Pi :). No doubt I am on a very old kernel, but I guess there are a lot of Pi's out there on old kernels. I wonder if a better detection method would simply be to check if the VideoCore header files are present?

The new version should be completely backwards compatible with the current stable version

OK Thanks for easing my mind. Good to hear.

I had to disable the platform detection code because it said my Pi was not a Pi

Do you know off the top of your head which Pi you have?
The check is simple:

cat /sys/firmware/devicetree/base/model

Please run this and see if it returns anything. In the meantime, I'll investigate if this is version dependent.

EDIT: Yes, it is version dependent. Only the later versions of Pi support this. 😞 . Perhaps it's dependent in the distro version only; in any case, this needs fixing because it should work always, not just the newfangled Pi's.

The 'correct' way to do it is to determine the hardware revision number ( get_version() in Mailbox) and concoct the string from that. I was trying to avoid that.

Hardware revision numbers are totally unique per pi-version. So, what I can do, is use firmware as above as fallback for the cat[2]. This is implies that the script needs to know all the model names of the Pi's anyway[1].


If you don't mind, please give me your distro version as well, to see how old it is.
E.g. this is my Pi 2:

> lsb_release -a
No LSB modules are available.
Distributor ID: Raspbian
Description:    Raspbian GNU/Linux 9.4 (stretch)
Release:        9.4
Codename:       stretch
  • I've updated all my Pi's to the very latest release possible. This is as up to date as it gets.

Can you check if cat /proc/cpuinfo returns something for you? There's a field Revision which is the desired version number. I think this command is a safe option for old versions as well.


[1] And this implies that I can do away with the cat altogether....shame, it was such an elegant approach.
[2] This can't be done via mailbox for obvious reasons. Will look for other methods

I wonder if a better detection method would simply be to check if the VideoCore header files are present?

You mean the bcm stuff? It's a trivial matter to copy these to a non-Pi machine, should you feel so inclined. It's not fool-proof enough.

Also, theoretically you could be running on a BCM platform which is not a Pi......actually that would be no problem at all. Never mind.

mn416 commented

It's a Pi 1, Model B

> cat /sys/firmware/devicetree/base/model
cat: /sys/firmware/devicetree/base/model: No such file or directory
> lsb_release -a
-bash: lsb_release: command not found

-bash: lsb_release: command not found

Ouch. If this doesn't work, your distro is old.

Never mind. See instead if cat /proc/cpuinfo works for you.

The relevant lines for the Pi 2:

Hardware        : BCM2835
Revision        : a01041

You should have something similar.

@mn416 Have fix ready for detectPlatform. Please confirm that cat /proc/cpuinfo/ works for you.

Edit: Never mind, I created the PR. Please check if both platform detection scripts work for you now.

mn416 commented

Hi @wimrijnders,

The new explicit VPM/DMA is finally in development. I need to update the README and add a more interesting example, but these are unlikely to conflict with anything you are doing.

Yay! Was looking a bit forward to it. Will see if I can test it.

  • Any chance of a Rot3D kernel that uses it? Perhaps therre's one already?
  • Would you appreciate a review? I consider it good practise, always.

Thanks for pushing through. I understand you don't have much time for pet-projects, your effort is appreciated. This also enables me to push further PR's

Aside, I'm trying hard to not buy a Pi Zero right now. You should imagine the pull I feel every time I pass the electronics outlet. I sort of want to get the collection complete: got 1 2 3, zero should be on the list.

Never imagined computers could be collectibles, like Pokemon cards!

Skipping the Compute Modules though, even if it's cutesy small. You need an expensive peripheral setup to program it at all. Perhaps one day (you only need to get the dev-kit once I would think).

This looks relevant and interesting to me : Introduction to compute shaders. I'll take the time soon to read it in detail.

Still don't understand the term 'shader' though. I see no difference with 'kernel' in this project.

@mn416 Any chance of setting up a Rot3D kernel with the new DMA stuff?

I ask for two reasons:

  • to add it to Rot3DLib for future profiling of all kernels
  • to add it to the documentation as the next step in evolution of the kernel (perhaps it's a major improvement on the previous, who knows?)

Perhaps this is something I could attempt myself?

@mn416 New DMA example: working in QPU as well as emulation mode. Great work!

Now, is it possible to overlap DMA read/writes while computing internally? My intuition says 'of course', but it is not apparent from the example.

I would appreciate if you also make an example to (explicitly) show how to overlap DMA with computation. Just a thought.

EDIT: Also appreciate the commenting in the example. It is just verbose enough to let the code make sense.

EDIT2: Now, is it also possible to overlap DMA with direct memory access? Or would you regard this as an 'exercise for the reader', i.e. me?

mn416 commented

Hi @wimrijnders,

Now, is it possible to overlap DMA read/writes while computing internally?

Yep, that is indeed possible.

is it also possible to overlap DMA with direct memory access?

By "direct memory access", I assume you mean the dereferencing operators. Unfortunately not. The compiler would need to be cleverer for that.

I would appreciate if you also make an example to (explicitly) show how to overlap DMA with computation. Just a thought.

Yes, I'm hoping to do this. I'd like a compelling example of these new features. Maybe matrix multiplication, or a perceptron / multi-level perceptron. Still deciding...

I think Rot3D is not a good example of the new features, because it's basically memory bound (not a good compute to memory ratio). That's why we don't see a great speedup over the ARM in this example.

One of the main features still missing from the library is the ability to use sub-word operations. For example, you can split a 32-bit word into 4 bytes and do 64-way vector operations -- that certainly improves the compute to memory ratio. Low-precision arithmetic is becoming v popular in neural nets.

I'd like a compelling example of these new features

I'm considering implementing a Discrete Fourier transform (not FFT) for VideoCore. It's actually a reason why I came here in the first place. Plenty of calculation required there, perhaps it would be a better example.

I understand your concerns about Rot3D, you mention it in the README examples. OTOH, if you manage to speed that up something awesome, it would be a very compelling example indeed.

By "direct memory access", I assume you mean the dereferencing operators.

I just mean the 'old' style of loading data from/to main memory. I see no technical reason why this couldn't be combined with DMA. But you're suggesting that the Lib-code can't handle this right now, correct?

For example, you can split a 32-bit word into 4 bytes and do 64-way vector operations

In the reference doc eg. page 57 onwards, I read that you can also use 8 and 16 bit elements, that would certainly allow you to pack more data into the calculation at the price of precision. Not sure how this lowered precision would be useful though, I'm too much of an exact thinker to appreciate it.

Another example I would like to see running is a Mandelbrot set calculation. That should be really effective on the QPU's, since basically you're looping over a limited set of values. My understanding of the DSL is just too limited to be able to implement this (and DFT), hoping to get up to par soon.

@mn416 Trying to chart the memories available to a QPU and further. This is what I got till now:

memory_qpu

Does this look OK to you? Any obvious things I've got wrong?

Questions on this:

  • Do you know how deep the 'recieve FIFO' is? Is it also 16-vector wide?
  • The reference document mentions registers s,t,r an b. Can you tell me where to put them in the diagram? I have no idea.

Thanks.

mn416 commented

Another example I would like to see running is a Mandelbrot set calculation

Excellent idea. This would be a nice example and should be reasonably straightforward. Each pixel is computed independently, so I imagine it will look a bit like the GCD example. You shouldn't need DMA for this example, the store function should do.

mn416 commented

The diagram seems ok, but is it not better to leave this level of detail to the manual? My understanding is that only 4KB of the VPM is available to the QPUs for general use. Also the regfiles are 32x16x4=2KB in size. Almost the size of the VPM, but the crucially the VPM is shared so data can be loaded once (expensive) and used many times (cheap). Not sure about s,t,r,b.

From experimentation, I believe the receive FIFO is 4 elements deep, but I may be missing a setting that makes it 8 elements deep.

Thanks for answering. There are details which I'm still struggling to understand. I hope you can enlighten me.

The diagram seems ok, but is it not better to leave this level of detail to the manual?

I would love to agree with you, but I found that the overview diagram is incomplete. There are elements not drawn in there which are mentioned in the text. Notably:

  • There are 2 TMU's per slice, not one.
  • FIFO's between TMU and QPU
  • TMU's have a Level 1 cache
  • the srtb registers, which must be somewhere

My diagram is an attempt to get them in view, for my better understanding. In addition, I want to know the actual sizes of the memory elements, mostly not mentioned in the document.

I don't want to draw the whole thing, just the memory parts that are relevant to a QPU.


My understanding is that only 4KB of the VPM is available to the QPUs for general use

Yes, page 53:

...it is practical for some of the memory to be reserved for general-purpose processing whilst 3D is
operating so long as at least 8Kbytes is left for 3D use.

So 12KB - 8KB (reserved) = 4KB.

From reading, this is because capacity is reserved for automatic execution of various shader types. I would like to point out that it's possible to disable the special shaders (fragment, vertex, coordinate) and run user programs only, see page 89. My hope (note emphasis) is that disabling the special shaders will free more capacity in the VPM for general use.


Also the regfiles are 32x16x4=2KB in size.

For the 32:

Page 17:

The register space associated with each of the A and B regfiles can address 64 locations, 32 of these are backed by the physical registers while the other 32 are used to access register-space I/O.

Now, the mapped registers are shown in table 14 on page 37. If you look at any specific register definition, eg TLB_STENCIL_SETUP on page 49, you see that these registers are 32-bits.

It follows in my thinking that the general-purpose registers are also 32-bits, otherwise the mapping is wonky. While it may be possible, I really cannot imagine a memory scheme where half of the addresses are for 64-byte registers and the other half are for 4-byte registers. Perhaps I'm missing something here? I you have further documentation for this, please share.

(Even then, the 64B is wrong. It should be 32x4 = 128B)


From experimentation, I believe the receive FIFO is 4 elements deep, but I may be missing a setting that makes it 8 elements deep.

There is; it depends on whether threads are enabled or not. This is the reason that I asked #41.

page 40:

multi-threaded shaders must be careful to use only 1/2 the FIFO depth before reading back.

So, if you can guarantee that the kernel running is not multi-threaded, you can use all 8 elements of the FIFO.

As you can see, I have a lot to learn about QPU's. I really hope you don't mind if I discuss the hardware stuff with you.

Addendum:

From experimentation...

This bothers me; it should have been exactly specified in the reference doc's. It's not the only thing that is vague.


Page 39, "QPU Interface", says that there are 8 slots in the receive FIFO for color data.

Color data is then defined as 32-bits (RGBA8888), meaning that a FIFO would only be half a 16-vector big. This can't be right.

The logical assumption to make is that a slot contains a 16-vector of color data. But I'm struggling to find proof of this in the document. I keep on re-reading this part, I find the language confusing and ambiguous.


On regfile elements, "Thread Control" p.20:

When the QPU is executing a second hardware thread, the upper and lower 16 locations of each physical regfile are swapped by inverting address bit 4. This splits each regfile to provide 16 vectors of local thread storage.

Two things about this:

  • Multi-threading has effect on the regfile size as well. If multi (actually bi) threading can truly be disabled, this doubles the regfile size. I don't see a reason to use multi-threaded kernels, do you?
  • Assuming that 'vector' means '16 32-bit values' like elsewhere in the document, then indeed the memory elements are 16x4 bytes wide. I'm confused, how does this match with the 32-bit memory-mapped regsiters?

On VPM size, page 53:

From the QPU perspective the window into the locally allocated portion of the VPM is a 2D array of 32-bit words, 16 words wide with a maximum height of 64 words.

So yes, 4KB if the window can't be changed. I get the impression that the 12KB can't ever be accessed fully; it's something that I'll just have to accept.

What about registers r0-r5, can they be regarded as 16-vectors as well? The docs state that ro-r3 deal with 32-bit values only (p.18).

This is my current hypothesis on how the QPU works as a 16-way SIMD device, perhaps you can confirm:

There are 16 distinct states within the QPU, one for each value of a 16-vector processed.
Internally, a state deals with only 32-bit values, but there are 16 states.

A given register value can thus be viewed as a stack of 32-bit values, 16 deep,
where each value is processed independently.

Does this make sense? Hoping for corrections or confirmation.

Another example I would like to see running is a Mandelbrot set calculation

Excellent idea.

I'm glad you agree. I'm itching to make this, or at least give it a start.

@mn416 I've been thinking about a good showcase for small-value integers, something you mentioned previously you want to implement.

I think something with cellular automata would be suitable. These usually deal with small values only. It would be nice, however, to be able to show every step while running. I've just spent some time in the garden ruminating about how to do this.

Something like your HeatMap example, but as a cellular automaton.

I realize that this is long-term thinking

mn416 commented

I think something with cellular automata would be suitable.

Excellent idea. If we pick Game of Life then we just need 1 bit per cell and can probably just do bit-wise operations on 32-bit values to implement the state transition function, i.e. treat a 16 word vector as 512 bit vector. This also sounds like a good example to demonstrate the new DMA features.

Yes. But Game of Life is so boring.....

Ooh, is 1 bit also possible? I thought 8-bit was the minimum. I suppose the 1-bit handling can be implemented within the kernel.

And then just for kicks make a giant game of life board!

I'm currently thinking over two things:

  • a very simple graphics viewer, to display QPU output in a nice view. Would you be able to live with something like based on Qt? Free for non-commercial projects.
  • kernel scheduling. The VideoCore FFT application already has the basics for this (been reading #9). The basics are not very hard, a simple start can be made. Although I can predict already that a robust implementation is going to be much more extensive.

Please note, 'just thinking' and a bit of research. Not going to attempt these any time soon.

@mn416 Ping. #66 is the last big thing I'll do before going on vacation (leaving on the 26th).

Any chance of reviewing the pending PR's before that? I'd like to have some closure before leaving for vacation.

Wrt graphic viewer (this is all still speculative):

I was considering Qt because it's C++ based. However, Qt Creator is a 5.7GB download[1], which I consider fatal overhead for just compiling a 'simple' graphics viewer.

Instead, I've been looking at what a Pi can offer out of the box wrt GUI programming. A good candidate appears to be python with the Tkinter library.

So I imagine a python graphic front-end which can interface with a c++ back-end, and which can display the result - think of Mandelbrot. The nice thing about that is that it works on a Pi without any further installation required.

A python <-> c++ binding is doable, I've done it before. You wouldn't happen to have python experience, would you?

(Note that this is still all vaporware - just thinking out loud)


[1] I know this because I just upgraded to latest version of QT Creator

multi-threaded shaders must be careful to use only 1/2 the FIFO depth before reading back.

So, if you can guarantee that the kernel running is not multi-threaded, you can use all 8 elements of the FIFO.

No, you can't. The FIFO is actually 8-deep, but it is used for both request and receiving, so you can stack up to 8/2=4 requests to the FIFO even if a kernel is single-threaded.

Thanks @Terminus-IMRC for answering.

However: The documentation makes a clear distinction between request and receive FIFO.

VideoCore reference documentation, page39:

Each TMU has associated with it a ‘request’ (TFREQ) and ‘receive’ (TFRCV) FIFO per QPU

Although I must say that elsewhere the text is open to interpretation. Also, it wouldn't be the first time I detected inconsistencies in the document.

In this case, I would say that experience trumps whatever is written in the documentation. So I'll seriously keep your comment in mind.


EDIT: It doesn't make sense that a FIFO could be bidirectional, by definition.

Also, assuming it's a single FIFO, the input/output length depends on how you use it. E.g. you might not read anything into the QPU and output 8 result vectors.

EDIT2: Removed brainfart in previous EDIT.

Never mind. I see your point. I think you're talking about data only.

  • FIFO TFREQ stores requests from the QPU to the TMU for data.
  • FIFO TFRCV supplies the requested data from the TMU to the QPU.

You're talking about TFRCV only. That makes more sense.
I still don't see how a FIFO can work in two directions at the same time, though.

mn416 commented

Any chance of reviewing the pending PR's before that?

I'd like to get #66 merged soon, yes. Just need to sort out #52 first, which could be done simply by conditionally including/excluding RegisterMap.cpp as part of the library. For now, the default could be exclude but we can just flip in future whenever once the benefits outweigh the costs.

@mn416 I've changed the makefile for #52 so that it skips the bcm-headers by default. Please try that first before secluding RegisterMap.

Sorry, bad wording by me... It seems that there are 2 FIFOs and 8 entries on a TMU, and 4 are used for request FIFO and the other 4 are used for receive FIFO.

Been examining the emulator code to understand how memory reads and write work.

@mn416 @Terminus-IMRC is the following correct? Very much helicopter view.

EDIT: Following is the case for gather/receive calls. Direct reads also go through the VPM

Read data

  • uses TMU
  • Load request FIFO using mapped registers TMU[01]_[STRB] (read as reg ex)
  • When request done, load data from response FIFO into QPU via acc r4

Write data:

  • Uses VPM
  • A Store request to VPM is prepared
    • This waits for any pending DMA store to complete,
    • Then starts its own DMA store

@mn416 I might have an actually useful application for QPULib: Goertzel transform. This is an alternative calculation for FFT.

I've been thinking about it. This transform can be parallelized something awesome, much better than FFT. Will get back on this after my vacation.

mn416 commented

I might have an actually useful application for QPULib: Goertzel transform. This is an alternative calculation for FFT.

I don't know much about that domain, but more QPULib applications/examples will definitely make me happy :)

Hi there, currently on my way to France for vacation.

I believe that, because Goertzel transform is well-parallellizable, it should be possible to obtain the full effect of 12x16 SIMD concurrency. This will be a killer application for QPU IMHO.

Also to note, it can be made compatible with Fourier. I'm truly excited about this. But it will have to wait till I get back from camping :-), August 12.

Please note that the Goertzel transform is actually a Goertzel filter to search for a specific frequency in a signal. When it comes to QPU-FFT, consider a look at http://www.aholme.co.uk/GPU_FFT/Main.htm

@mn416 Hereby checking in, showing a sign of life..
I've been back a week or two now from vacation, but got stuck in my day job first. Hoping to return to QPULIB soon, looking forward to it.

mn416 commented

Hi @wimrijnders,

Glad to hear it. Unfortunately, I've also been too busy recently to make any further progress on the development branch.

@mn416 Yeah, that makes two of us. That's OK, the project won't run away any time soonand I'm still interested in progressing the state of the art. We'll get back here eventually. Good luck with whatever you're doing!

Is there a way to estimate the performance of a compiled kernel ? By calling:

    // Encode target instrs into array of 32-bit ints
    Seq<uint32_t> code;
    encode(&targetCode, &code);

in emulation mode also, I can at least see how many target instructions there are, but I'm unsure how this correlates to code execution time.

mn416 commented

Hi @robiwano,

Not at present. It should be straightforward to extend the emulator to count the number of instructions executed. Of course, this will not account for the memory access cost.

Matt

I have now a working complex MAC function (complex values are interleaved floats re/im):

void gpu_cmac(Int n, Ptr<Float> m, Ptr<Float> a, Ptr<Float> b, Ptr<Float> acc)
{
    Int inc          = numQPUs() << 4;
    Float rm         = *m;
    Float rm_inv     = Float(1.0f) - rm;
    Ptr<Float> p_a   = a + index() + (me() << 4);
    Ptr<Float> p_b   = b + index() + (me() << 4);
    Ptr<Float> p_acc = acc + index() + (me() << 4);
    gather(p_a); gather(p_b); gather(p_acc);
    Float ra, rb, racc;
    For(Int i = 0, i < n, i = i + inc)
        gather(p_a + inc); gather(p_b + inc); gather(p_acc + inc);
        receive(ra); receive(rb); receive(racc);
        Float re_1   = ra * rb;
        Float im_1   = rotate(ra, 15) * rb;
        Float im_2   = rotate(ra, 1) * rb;
        re_1         = re_1 - rotate(re_1, 15);
        im_1         = im_2 + rotate(im_1, 1);
        Float result = im_1 * rm + re_1 * rm_inv;
        store(result + racc, p_acc);
        p_a   = p_a + inc;
        p_b   = p_b + inc;
        p_acc = p_acc + inc;
    End
    receive(ra); receive(rb); receive(racc);
}

plus I added the spin-to-completion functionality of GPU_FFT to avoid the mailbox overhead, and with it it is a lot faster than the reference code. However I would like to be able to process say 4 batches of 512 complex MACs accumulating to a single 512 complex accumulator, and I have no idea how to express that with QPULib :)

Is it possible to have Ptr<Ptr<Float>> as parameter to a kernel? :)

@mn416 I might have an actually useful application for QPULib: Goertzel transform. This is an alternative calculation for FFT.

I've been thinking about it. This transform can be parallelized something awesome, much better than FFT. Will get back on this after my vacation.

It's really not an alternative, as it only computes single bins from the DFT, albeit efficiently. Nonetheless, it is a very relevant and useful algorithm.

Hi there, great some discussion here.

It's really not an alternative, as it only computes single bins from the DFT, albeit efficiently. Nonetheless, it is a very relevant and useful algorithm.

I can answer this in several layers, I will stick to this one: I did not state that the Goertzel should replace the DFT, I stated that goertzel can be parallelized much better than it. I hope you see the nuance difference.

I realize fully that the Goertzel would not be a direct replacement for FFT, but when you're dealing with limited number of frequencies it's a better alternative.

This will of course not stop me from wanting to do my utmost to get goertzel in competitive shape. I'm actually looking to make some form of progressive benchmarking for both, in the spirit of how the docs are set up in this project. Also, a great finger exercise for getting to grips with GPU programming....and indeed useful for my work, where we use goertzel massively.


EDIT: OK scrolling back I can definitely see how I might have implied it. I don't remember my line of thinking then any more, right now my above comment holds.

Another issue I'd like input on, I plan to use both GPU_FFT and QPULib in a project for a RPi Zero. But I see potential collision problems, mainly due to mailbox, so I'd like to extract the handling of the mailbox into a separate repository, which I can then use from both GPU_FFT and QPULib.