doe300/VC4C

Combine DMA loads

long-long-float opened this issue · 6 comments

When we compile following OpenCL code which calls vload16 three times with vc4c --asm -O3 -o dma_loads.asm dma_loads.cl, VC4C outputs the following assembly(dma_loads.txt). This contains three DMA loads, but these can be combined into one DMA load.

__kernel void dma_loads(int width, int height, __global uchar *in, __global uchar *out)
{
    for (int y = 1; y < height - 1; y++) {
        size_t idx = y * width;
        uchar16 up   = vload16(idx - width, in);
        uchar16 center = vload16(idx, in);
        uchar16 down = vload16(idx + width, in);

        uchar16 r = (
            up                                               / (uchar16)(3) +
            center                                           / (uchar16)(3) +
            down                                             / (uchar16)(3));

        vstore16(r, idx, out);
    }
}

dma_loads.txt

I want to implement the combiner and think the method.

At each block in CFG and LLVM IR

; Function Attrs: convergent nounwind
define spir_kernel void @dma_loads(i32 %width, i32 %height, i8 addrspace(1)* %in, i8 addrspace(1)* %out) local_unnamed_addr #0 !kernel_arg_addr_space !3 !kernel_arg_access_qual !4 !kernel_arg_type !5 !kernel_arg_base_type !5 !kernel_arg_type_qual !6 !kernel_arg_name !7 {
  %sub = add nsw i32 %height, -1
  %cmp23 = icmp sgt i32 %height, 2
  br i1 %cmp23, label %.lr.ph.preheader, label %._crit_edge

.lr.ph.preheader:                                 ; preds = %0
  br label %.lr.ph

._crit_edge:                                      ; preds = %.lr.ph, %0
  ret void

.lr.ph:                                           ; preds = %.lr.ph.preheader, %.lr.ph
  %y.024 = phi i32 [ %inc, %.lr.ph ], [ 1, %.lr.ph.preheader ]
  %mul = mul nsw i32 %y.024, %width
  %sub1 = sub i32 %mul, %width
  %call = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %sub1, i8 addrspace(1)* %in) #2
  %call2 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %mul, i8 addrspace(1)* %in) #2
  %add = add i32 %mul, %width
  %call3 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %add, i8 addrspace(1)* %in) #2
  %div = udiv <16 x i8> %call, <i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3>
  %div4 = udiv <16 x i8> %call2, <i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3>
  %add5 = add nuw <16 x i8> %div4, %div
  %div6 = udiv <16 x i8> %call3, <i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3, i8 3>
  %add7 = add <16 x i8> %add5, %div6
  tail call spir_func void @_Z8vstore16Dv16_hjPU3AS1h(<16 x i8> %add7, i32 %mul, i8 addrspace(1)* %out) #2
  %inc = add nuw nsw i32 %y.024, 1
  %cmp = icmp slt i32 %inc, %sub
  br i1 %cmp, label %.lr.ph, label %._crit_edge
}
  1. Collect vload16(actually _Z7vload16jPU3AS1Kh).
  2. Collect DMA load addresses from 1st argument of vload16.
  3. Check whether load addresses are regular intervals.
  4. If true, combine theses loads.

I think the checking regular intervals is challenging. The symbolic execution can be used.

Example

Collect vload16 (and address variables)

%mul = mul nsw i32 %y.024, %width
%sub1 = sub i32 %mul, %width
%add = add i32 %mul, %width

%call = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %sub1, i8 addrspace(1)* %in) #2
%call2 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %mul, i8 addrspace(1)* %in) #2
%call3 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %add, i8 addrspace(1)* %in) #2

Addresses

  1. %mul - %width
  2. %mul
  3. %mul + %width

These are regular intervals (%width), then these are combined (I should create new function dma_load and vpm_load).

dma_load(i32 %x.093, i8 addrspace(1)* %in, 3 /*= rows*/, 16/*= columns*/)
%call = vpm_load
%call2 = vpm_load
%call3 = vpm_load

There is already some related code there. This was added some while ago to do a similar job, but I am not sure whether it is still applied. Anyway, that might be a good point to start.

@doe300 I have a question. Is there a way to find the instruction corresponded the local (for example, I want to get the instruction %sub1 = sub i32 %mul, %width from the value i32 %sub1). Or should I create this method?

%mul = mul nsw i32 %y.024, %width
%sub1 = sub i32 %mul, %width
%add = add i32 %mul, %width

%call = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %sub1, i8 addrspace(1)* %in) #2
%call2 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %mul, i8 addrspace(1)* %in) #2
%call3 = tail call spir_func <16 x i8> @_Z7vload16jPU3AS1Kh(i32 %add, i8 addrspace(1)* %in) #2

In general, you can query Local#getUsers(LocalUse::Type::WRITER) to get all writers.

If there is just one writer, Local#getSingleWriter() will do the trick. Also if you have a Value instead of the local, you can call Value#getSingleWriter() which does the same, but checks whether the value is a local. Of course the result needs to be checked for nullptr in both cases!

@doe300 I want to insert the instruction (extends IntermediateInstruction) which do VPM load here, but I cannot find it.
Is there such the instruction, or should I create the instruction?

The general memory access (before we know whether the memory area is lowered to a register, the VPM or accessed via TMU or DMA) is represented as MemoryInstruction.
After the lowering, there are no specific instruction types for the various lowered types (e.g. register, VPM), instead the MemoryInstruction is directly composed to the (hardware) instructions executed to do the memory accesses.
So if you want to insert a VPM access, have a look at the VPM header:

  • insertReadDMA, insertWriteDMA for "direct" DMA access (QPU <-> RAM), abstracting away the VPM
  • VPM::insertReadVPM, VPM::insertWriteVPM for VPM access (QPU <-> VPM), e.g. also for caching/exchanging data between QPUs
  • VPM::insertReadRAM, VPM::insertWriteRAM for DMA only access (VPM <-> RAM), e.g. to read/write back cached data

The VPM object required can be retrieved via the Method::vpm member.

Does this information suffice or do you need a special instruction type to represent VPM accesses (e.g. for further processing)?

I understand, thanks.