Combine DMA loads
long-long-float opened this issue · 6 comments
When we compile following OpenCL code which calls vload16
three times with vc4c --asm -O3 -o dma_loads.asm dma_loads.cl
, VC4C outputs the following assembly(dma_loads.txt
). This contains three DMA loads, but these can be combined into one DMA load.
__kernel void dma_loads(int width, int height, __global uchar *in, __global uchar *out)
{
for (int y = 1; y < height - 1; y++) {
size_t idx = y * width;
uchar16 up = vload16(idx - width, in);
uchar16 center = vload16(idx, in);
uchar16 down = vload16(idx + width, in);
uchar16 r = (
up / (uchar16)(3) +
center / (uchar16)(3) +
down / (uchar16)(3));
vstore16(r, idx, out);
}
}
I want to implement the combiner and think the method.
At each block in CFG and LLVM IR
; Function Attrs: convergent nounwind
define spir_kernel void @dma_loads(i32 %width, i32 %height, i8 addrspace(1)* %in, i8 addrspace(1)* %out) local_unnamed_addr #0 !kernel_arg_addr_space !3 !kernel_arg_access_qual !4 !kernel_arg_type !5 !kernel_arg_base_type !5 !kernel_arg_type_qual !6 !kernel_arg_name !7 {
%sub = add nsw i32 %height, -1
%cmp23 = icmp sgt i32 %height, 2
br i1 %cmp23, label %.lr.ph.preheader, label %._crit_edge
.lr.ph.preheader: ; preds = %0
br label %.lr.ph
._crit_edge: ; preds = %.lr.ph, %0
ret void
.lr.ph: ; preds = %.lr.ph.preheader, %.lr.ph
%y.024 = phi i32 [ %inc, %.lr.ph ], [ 1, %.lr.ph.preheader ]
%mul = mul nsw i32 %y.024, %width
%sub1 = sub i32 %mul, %width
%call = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %sub1, i8 addrspace(1)* %in) #2
%call2 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %mul, i8 addrspace(1)* %in) #2
%add = add i32 %mul, %width
%call3 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %add, i8 addrspace(1)* %in) #2
%div = udiv <16 x i8> %call, <i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3>
%div4 = udiv <16 x i8> %call2, <i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3>
%add5 = add nuw <16 x i8> %div4, %div
%div6 = udiv <16 x i8> %call3, <i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3>
%add7 = add <16 x i8> %add5, %div6
tail call spir_func void @_Z8vstore16Dv16_hjPU3AS1h(<16 x i8> %add7, i32 %mul, i8 addrspace(1)* %out) #2
%inc = add nuw nsw i32 %y.024, 1
%cmp = icmp slt i32 %inc, %sub
br i1 %cmp, label %.lr.ph, label %._crit_edge
}
- Collect
vload16
(actually_Z7vload16jPU3AS1Kh
). - Collect DMA load addresses from 1st argument of
vload16
. - Check whether load addresses are regular intervals.
- If true, combine theses loads.
I think the checking regular intervals is challenging. The symbolic execution can be used.
Example
Collect vload16
(and address variables)
%mul = mul nsw i32 %y.024, %width
%sub1 = sub i32 %mul, %width
%add = add i32 %mul, %width
%call = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %sub1, i8 addrspace(1)* %in) #2
%call2 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %mul, i8 addrspace(1)* %in) #2
%call3 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %add, i8 addrspace(1)* %in) #2
Addresses
%mul - %width
%mul
%mul + %width
These are regular intervals (%width
), then these are combined (I should create new function dma_load
and vpm_load
).
dma_load(i32 %x.093, i8 addrspace(1)* %in, 3 /*= rows*/, 16/*= columns*/)
%call = vpm_load
%call2 = vpm_load
%call3 = vpm_load
There is already some related code there. This was added some while ago to do a similar job, but I am not sure whether it is still applied. Anyway, that might be a good point to start.
@doe300 I have a question. Is there a way to find the instruction corresponded the local (for example, I want to get the instruction %sub1 = sub i32 %mul, %width
from the value i32 %sub1
). Or should I create this method?
%mul = mul nsw i32 %y.024, %width
%sub1 = sub i32 %mul, %width
%add = add i32 %mul, %width
%call = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %sub1, i8 addrspace(1)* %in) #2
%call2 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %mul, i8 addrspace(1)* %in) #2
%call3 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %add, i8 addrspace(1)* %in) #2
In general, you can query Local#getUsers(LocalUse::Type::WRITER)
to get all writers.
If there is just one writer, Local#getSingleWriter()
will do the trick. Also if you have a Value
instead of the local, you can call Value#getSingleWriter()
which does the same, but checks whether the value is a local. Of course the result needs to be checked for nullptr
in both cases!
The general memory access (before we know whether the memory area is lowered to a register, the VPM or accessed via TMU or DMA) is represented as MemoryInstruction
.
After the lowering, there are no specific instruction types for the various lowered types (e.g. register, VPM), instead the MemoryInstruction
is directly composed to the (hardware) instructions executed to do the memory accesses.
So if you want to insert a VPM access, have a look at the VPM header:
insertReadDMA
,insertWriteDMA
for "direct" DMA access (QPU <-> RAM), abstracting away the VPMVPM::insertReadVPM
,VPM::insertWriteVPM
for VPM access (QPU <-> VPM), e.g. also for caching/exchanging data between QPUsVPM::insertReadRAM
,VPM::insertWriteRAM
for DMA only access (VPM <-> RAM), e.g. to read/write back cached data
The VPM object required can be retrieved via the Method::vpm
member.
Does this information suffice or do you need a special instruction type to represent VPM accesses (e.g. for further processing)?
I understand, thanks.