[Raspberry Pi] DMA Optimization Thread
Opened this issue · 5 comments
Currently, each DMA frame looks like this:
- Wait for clock gate (copy 4 bytes to PWM FIFO)
- Copy 8 bytes from the current circular buffer frame to the GPSET and GPCLR registers.
- Copy 8 zeros to the current circular buffer frame to reset it.
In actuality, step 3 consists of copying 8 precomputed bytes from a PWM queue to the current frame so that PWM is more efficient.
Each DMA control block is 32 bytes of overhead. So the current implementation uses 36+40+40=116 bytes / frame in bandwidth.
There are a few ways to optimize this.
1: try to optimize control block 3
By only doing step 3 every, say, 2 frames, and instead making it move 16 bytes, we save an average of 16 bytes of overhead each frame.
(Optionally, we can also halve the PWM buffer period, and use the stride feature to copy each PWM frame into two GPIO frames when performing step 3. The only advantage this gives is decreased memory footprint and perhaps slightly better cache performance.)
This decreased data usage should lead to more consistent timing in the long run by decreasing bus contention. But in no longer having each frame use the same amount of data (36+40=76 bytes on even frames and 36+40+48=124 bytes on odd frames), it may increase local timing variations.
Another option is to put the buffer reseting routine into its own separate DMA channel and run that side-by-side at a lower AXI priority. The downside here is that DMA channels are a precious resource and the more we use them in userland, the more likely we are to interfere with other applications.
The final option is to just replace the buffer reseting with cpu-based code. This only needs to transfer 250000*8 = 2mb of data per second, which is totally doable on the cpu. The downside is that this takes away processor time from the motion planner. We currently have a good amount of cpu to spare though. The upside is that it will be simpler, and will likely have next to zero interference with the DMA operation.
2: Use the DMA STRIDE feature to combine steps 1 and 2.
Note that they would have to be reordered to account for the DREQ signal:
- Copy 16 bytes from the current circular buffer frame to the GPSET and GPCLR registers (16 bytes because it has to be contiguous over memory), and use STRIDE feature to then copy 16 bytes into the memory around (and including) the PWM data register.
- Copy 8 zeros to the current circular buffer frame to reset it.
In actuality, the PCM peripheral would have to be used for data pacing instead of the PWM peripheral, due to its distance in address space from the gpio register (0x3000 vs 0xc000 and STRIDE can only bridge 0x8000 gaps).
Previous memory bandwidth: 36+40+40 = 116 bytes/frame. New memory bandwidth: 64+40= 104 bytes/frame.
This is pretty dirty, and it also complicates DMA syncing as previously the STRIDE register was used to store the current index in the clock gating control block. It also only works well when using pins < 32, because otherwise you need to copy 5 words in part 1 due to the one word gap between GPSET[2] and GPCLR[2] registers.
3: Use separate DMA channels for the pacing and for the GPIO updating
If one can use a single DMA control block to continually pump data into the PWM queue (paced), and then another DMA control block on a different channel to continually copy the head of the queue into the GPIOs (unpaced), we can mostly get rid of the overhead of the control blocks. By using the STRIDE feature, we can repeat any operation over contiguous memory 2^15 times with one CB, optionally advancing the source/dest addresses arbitrarily.
Sadly, the PWM FIFO head cannot be read arbitrarily. A possible solution is to find irrelevant registers near the PWM FIFO register and use those to store our GPIO data. The relevant bit of address space is (each 1 word in length): CONTROL, STATUS, DMAC, [unused - mapped?], CH1 RANGE, CH1 DATA, FIFO IN, [unused - mapped?], CH2 RANGE, CH2 DATA.
We are not using the 2 CH2 registers, and CH1 DATA is supposedly not updated in FIFO mode, but may still remain readable/writeable. Thus one possibility is to push frame data into CH2 RANGE and CH2 DATA synchronized with the write to FIFO IN, This can be done as a 4-word copy that uses the STRIDE feature to repeat 2^15 times before loading the next control block.
Next, we have 1 DMA channel continually copying data from CH2 RANGE into GPCLR0, and another DMA channel continually copying data from CH2 DATA into GPSET0. These can actually use the WAITS feature in DMA to delay for up to 32 clock cycles between each word, so they won't be wasting too much bus traffic. It's also peripheral <-> peripheral, rather than peripheral <-> ram, so I think it's routed onto a separate bus.
If we were able to find unused registers near the PWM peripheral that are read as [GPCLR0 DATA], [0], [irrelevant], [GPSET0 DATA], then we could do the last part with just 1 DMA channel. Reading FIFO IN will always return the "bus default return value, pwm0", according to the processor documentation (pg 146)[1]. Thus, it should be possible to set pwm0 to [0], and then just write the desired [GPCLR0], [garbage], [garbage], [GPSET0] data to [CH1 DATA], [FIFO IN], [unused], [CH2 RANGE] continually, and copy this continually to [GPCLR0], [GPCLR1], [unused], [GPSET0] with a separate DMA channel. Each channel will require only 1 control block to be loaded every 2^15 frames, so bandwidth is 16 bytes/frame + minimum 16 bytes/frame overhead in the second DMA channel that continually copies the data (and then resetting the buffer, which now has to be done separately). Note also that every write now fits perfectly into a 128bit AXI "burst" operation, meaning that it can be pushed onto the bus all at once (?)
This method does have some drawbacks. First of all, we lose a bit of synchronization. It is possible for 1 DMA channel to copy data to the FIFO and then the other DMA channel not pick that up in time. In that case, we get an undetectable missed step. However, lots of time-sensitive communication uses DMA, so these should be avoidable - especially by using a higher AXI priority for the FIFO -> GPIO than the buffer -> FIFO operation. The other drawback is that the FIFO -> GPIO copy doesn't happen in sync with the buffer -> FIFO copy. If doing 1uS intervals for buffer -> fifo and 0.5 uS intervals for FIFO -> GPIO, then our maximum timing error is [0, 0.5 uS), or, +/- 0.25 uS. If we were to use 0.25 uS intervals for FIFO -> GPIO, then that's just +/- 0.125 uS spread. But then we have a total of 80MB/s bus traffic.
[1] Processor Documentation: http://www.raspberrypi.org/wp-content/uploads/2012/02/BCM2835-ARM-Peripherals.pdf
A recent project for the Rpi B+ features VGA output to the gpios running 1080p@60fps: http://raspi.tv/2014/vga-for-pi-debuts-at-camjam-alongside-hdmipi-production-model-no-1
This is WAY more bandwidth and precision than we're achieving with DMA. Reportedly, Gert is using DPI. He says his VGA adapter only works on the B+ because of certain pin routings, but if that's just hsync/vsync lacking a route, we might still be able to make use of the other data.
Unfortunately, I can't find any documentation anywhere about using DPI on the Raspberry Pi, so that's pretty much out of the question for now. His software is only distributed as a binary blob (https://github.com/fenlogic/vga666)
For some immediate benefits, it may be useful to test DMA priorities > 7 (15 is the max AXI bus priority, and when I wrote the code I mistakenly thought 7 was the max) and enabling 128-bit AXI bursts. May also want to try adding DMA_WAIT_RESP to the DMA transfer information flags. Those changes alone might make 1 MHz throughput achievable.
It may be possible to set pins with a single write to the GPLEV0 register instead of one write to each of GPCLR0 and GPSET0 when in 32-pin mode. This reduces each transfer down to 4 bytes. It does complicate the PWM/buffer reset code though - leaving it as-is would cause a pin to be immediately cleared on the frame after it is set.
The first time I tried this in 64-pin mode, it caused crashes. I believe that is because I was writing to system pins. And I believe all system pins are > 31.