Non-transpose implementation?
Closed this issue · 3 comments
I'm afraid my C++ isn't quite up to par, so I haven't been able to get this working correctly. What would need to change in order to compute A * B instead of A^T * B? I'm afraid Apple's code has outsmarted me. 😕
Thanks in advance for the help.
Hi Collin. I'm sorry for not seeing your question sooner. My email is a little backed up at the moment.
It's tricky to switch from A^T * B
to A * B
. I think Apple used A^T * B
for their performance test because it's a fast memory layout for measuring gflops. The individual values from A that get multiplied and accumulated against in B are both laid out sequentially in memory. Maximum cache hits. Etc.
The way the kernel works is that it chooses 8 rows of A and 8 columns of B and uses that to create an 8x8 sector of C. With A transposed, the problem translates to 8 columns of A^T and 8 columns of B to accumulate into an 8x8 sector of C. The kernel traverses 8 sequential column positions in each input matrix and then advances to the next row. strideA and strideB are used to advance the matrix pointers to the next row of 8 values.
If A is no longer transposed, the thing that changes is that we're feeding 8 columns of matrix A into the kernel when the matrix multiplication wants 8 rows from A. I think it could be possible for this approach in the kernel to still work somehow.
The outer products that are accumulated from accumulateOuterProduct()
into s00, s01, s10, s11
would probably have to be transposed but the bigger issue is that inputA
and inputB
would no longer share the same number of rows. Instead inputA
would have to increment 8 elements for each loop instead of incrementing 8 rows. (The next sector of A would be "to the right" instead of "down".)
Even with that worked out, there's a chance that the accumulated-transposed 4x4 outer products would have to be placed into the output matrix in a different order.
Sorry for not having a better answer. I hope that helps.
@collinhundley, I'm hoping you checked out WWDC this year. Tons of stuff showing up in macOS 10.13. Including Metal Performance Shader stuff that makes the compute kernel in this repo obsolete.
Hey @otto-schnurr, sorry I never responded to your last message but I really appreciate your help.
Very excited for MPS on High Sierra, they definitely make life a lot easier!