flame/blis

Memory location in the prefetch instructions

site-g opened this issue · 5 comments

I am a little confused with how to set the memory location in the prefetch instructions in the micro kernels. For example, in the following float32 kernel, the offsets of the memory addresses are set to be (m - 1) * sizeof(float) or (n - 1) * sizeof(float).

prefetch(0, mem(r12, 15*4)) // prefetch c + 0*rs_c
prefetch(0, mem(r12, rdi, 1,15*4)) // prefetch c + 1*rs_c
prefetch(0, mem(r12, rdi, 2,15*4)) // prefetch c + 2*rs_c
prefetch(0, mem(rdx, 15*4)) // prefetch c + 3*rs_c
prefetch(0, mem(rdx, rdi, 1,15*4)) // prefetch c + 4*rs_c
prefetch(0, mem(rdx, rdi, 2,15*4)) // prefetch c + 5*rs_c
jmp(.SPOSTPFETCH) // jump to end of prefetching c
label(.SCOLPFETCH) // column-stored prefetching c
mov(var(cs_c), rsi) // load cs_c to rsi (temporarily)
lea(mem(, rsi, 4), rsi) // cs_c *= sizeof(float)
lea(mem(rsi, rsi, 2), rcx) // rcx = 3*cs_c;
prefetch(0, mem(r12, 5*4)) // prefetch c + 0*cs_c
prefetch(0, mem(r12, rsi, 1, 5*4)) // prefetch c + 1*cs_c
prefetch(0, mem(r12, rsi, 2, 5*4)) // prefetch c + 2*cs_c
prefetch(0, mem(r12, rcx, 1, 5*4)) // prefetch c + 3*cs_c
prefetch(0, mem(r12, rsi, 4, 5*4)) // prefetch c + 4*cs_c
lea(mem(r12, rsi, 4), rdx) // rdx = c + 4*cs_c;
prefetch(0, mem(rdx, rsi, 1, 5*4)) // prefetch c + 5*cs_c
prefetch(0, mem(rdx, rsi, 2, 5*4)) // prefetch c + 6*cs_c
prefetch(0, mem(rdx, rcx, 1, 5*4)) // prefetch c + 7*cs_c
prefetch(0, mem(rdx, rsi, 4, 5*4)) // prefetch c + 8*cs_c
lea(mem(r12, rsi, 8), rdx) // rdx = c + 8*cs_c;
prefetch(0, mem(rdx, rsi, 1, 5*4)) // prefetch c + 9*cs_c
prefetch(0, mem(rdx, rsi, 2, 5*4)) // prefetch c + 10*cs_c
prefetch(0, mem(rdx, rcx, 1, 5*4)) // prefetch c + 11*cs_c
prefetch(0, mem(rdx, rsi, 4, 5*4)) // prefetch c + 12*cs_c
lea(mem(r12, rcx, 4), rdx) // rdx = c + 12*cs_c;
prefetch(0, mem(rdx, rsi, 1, 5*4)) // prefetch c + 13*cs_c
prefetch(0, mem(rdx, rsi, 2, 5*4)) // prefetch c + 14*cs_c
prefetch(0, mem(rdx, rcx, 1, 5*4)) // prefetch c + 15*cs_c

While in the float32 kernel for edge cases, the offsets are set to be (m/2 - 1) * sizeof(double) or (n/2 - 1) * sizeof(double).

prefetch(0, mem(rcx, 7*8)) // prefetch c + 0*rs_c
prefetch(0, mem(rcx, rdi, 1, 7*8)) // prefetch c + 1*rs_c
prefetch(0, mem(rcx, rdi, 2, 7*8)) // prefetch c + 2*rs_c
prefetch(0, mem(rdx, 7*8)) // prefetch c + 3*rs_c
prefetch(0, mem(rdx, rdi, 1, 7*8)) // prefetch c + 4*rs_c
prefetch(0, mem(rdx, rdi, 2, 7*8)) // prefetch c + 5*rs_c
jmp(.SPOSTPFETCH) // jump to end of prefetching c
label(.SCOLPFETCH) // column-stored prefetching c
mov(var(cs_c), rsi) // load cs_c to rsi (temporarily)
lea(mem(, rsi, 4), rsi) // cs_c *= sizeof(float)
lea(mem(rsi, rsi, 2), rbp) // rbp = 3*cs_c;
prefetch(0, mem(rcx, 5*8)) // prefetch c + 0*cs_c
prefetch(0, mem(rcx, rsi, 1, 5*8)) // prefetch c + 1*cs_c
prefetch(0, mem(rcx, rsi, 2, 5*8)) // prefetch c + 2*cs_c
prefetch(0, mem(rcx, rbp, 1, 5*8)) // prefetch c + 3*cs_c
prefetch(0, mem(rcx, rsi, 4, 5*8)) // prefetch c + 4*cs_c
lea(mem(rcx, rsi, 4), rdx) // rdx = c + 4*cs_c;
prefetch(0, mem(rdx, rsi, 1, 5*8)) // prefetch c + 5*cs_c
prefetch(0, mem(rdx, rsi, 2, 5*8)) // prefetch c + 6*cs_c
prefetch(0, mem(rdx, rbp, 1, 5*8)) // prefetch c + 7*cs_c
prefetch(0, mem(rdx, rsi, 4, 5*8)) // prefetch c + 8*cs_c
lea(mem(rcx, rsi, 8), rdx) // rdx = c + 8*cs_c;
prefetch(0, mem(rdx, rsi, 1, 5*8)) // prefetch c + 9*cs_c
prefetch(0, mem(rdx, rsi, 2, 5*8)) // prefetch c + 10*cs_c
prefetch(0, mem(rdx, rbp, 1, 5*8)) // prefetch c + 11*cs_c
prefetch(0, mem(rdx, rsi, 4, 5*8)) // prefetch c + 12*cs_c
lea(mem(rcx, rcx, 4), rdx) // rdx = c + 12*cs_c;
prefetch(0, mem(rdx, rsi, 1, 5*8)) // prefetch c + 13*cs_c
prefetch(0, mem(rdx, rsi, 2, 5*8)) // prefetch c + 14*cs_c
prefetch(0, mem(rdx, rbp, 1, 5*8)) // prefetch c + 15*cs_c

I have two questions:

  1. Which one of the above codes is correct, or both are OK?
  2. Why the offsets are set to be (m - 1) rather than 0?

Both prefetch strategies accomplish the same thing. Consider the ideal case where the data pointer is originally aligned to a cache line boundary (typically 64 bytes). Then you can prefect any address in that 64-byte region and that cache line will be loaded. Then you increment by 64 bytes, prefetch again, etc. and all is good. However, if the pointer is NOT aligned, then the first 64-byte region actually spans two cache lines. You now have the choice to prefetch the first or second cache line: prefetching at offset 0 is always the first one and any address within the last element (f32 or f64) is always the second one (note that both strategies above accomplish this). Here's an example:

| 64 bytes | 64 bytes | 64 bytes |
|CL1|    CL2   |    CL3   | CL4  |

If we assume that three 64-byte regions requires three prefetches, then prefetching at offset 0 accomplishes:

| 64 bytes | 64 bytes | 64 bytes |
|XXX|    XXX   |    XXX   | ---  |   XXX = prefetched, --- = not prefetched

Instead, prefetching at any address within the last element accomplishes:

| 64 bytes | 64 bytes | 64 bytes |
|---|    XXX   |    XXX   | XXX  |   XXX = prefetched, --- = not prefetched

So far there is no difference in the average case. However, and in particular when considering the C microtile, a later iteration of the microkernel will access the region just beyond what was prefetched (and loaded/stored) here. The last cache line accessed "spills over" into the next 64-byte region, which is precisely the data that will NOT be prefetched in that later microkernel iteration in approach # 2. So, the prefetching in the later iteration looks like this:

| 64 bytes | 64 bytes | 64 bytes |
|YYY|    XXX   |    XXX   | XXX  |   XXX = prefetched, YYY = not prefetched here, but probably already warm in cache from a previous load.

If instead we prefetch at offset 0 then there is no benefit from previous loads/stores.

Actually, the second one maybe should be 15*4 also in case the pointer is only 4-byte aligned.

I see. Prefetch the last element will benefit the next iteration. This design is ingenious.

Then I think the address in the prefetch instructions for the next A micropanels may not be correct.


prefetch(0, mem(rdx, r9, 1, 5*8))

prefetch(0, mem(rdx, r9, 2, 5*8))

Here rdx = a + ps_a4 is the adress of A's next micropanel, r9 = cs_a. For packed A, cs_a is 24 byte, the prefetch does not have problem as the prefetches will overlap with each other. However, if my understanding is correct, for small matrix, the micropanels for A will not be packed, then cs_a > 64 is possible. There exists the situation of

| 64 bytes | 64 bytes | 64 bytes |
|---|    XXX   |    ---   |    XXX   |    ---   |    XXX = prefetched, --- = not prefetched
|        cs_a        |        cs_a        |

What we want is the first 6*4 byte in each column of A, which will never be prefetched when cs_a > 64.

Is my understanding correct?

In the case that cs_A is too large, then you would always need two prefetches to make sure to get all of the next data. I didn't write this particular code but I guess the design decision was that picking an offset of 5*8 gives optimal performance if the data is tightly packed and at least gets ~1/2 of the data prefetched in the average, large stride case. There is a limit on how many L1 prefetches can be in flight at the same time so doing two prefetches per row is probably too many.

I see. So this need to be judged by profiling. Thank you!