NervanaSystems/maxas

Question about broadcast of shared memory in SGEMM wiki

Closed this issue · 3 comments

Hi:
I readed the SGEMM document through, I found the following narratation is different from Nvidia's document.

In the wiki: https://github.com/NervanaSystems/maxas/wiki/SGEMM
It was said:
How do you load from shared using quad vectors without bank conflicts? Well, according to the documentation, so long as all the accesses are within 32 words (128 bytes), we're fine.

from CUDA C PROGRAMMING GUIDE V7.5 =>section G.5.3. Shared Memory
It was said:
A shared memory request for a warp does not generate a bank conflict between two
threads that access any address within the same 32-bit word (even though the two
addresses fall in the same bank): In that case, for read accesses, the word is broadcast to
the requesting threads and for write accesses, each address is written by only one of the
threads (which thread performs the write is undefined).

The difference is that 32 words and 32 bit word , which is right ?

Nvidia's documentation is more correct. Your accesses need not be within a
contiguous 128 bytes. That's just how I happen to be using them in this
case.

Another interesting fact on 128 bit vector loads is that they require at a
minimum 2 transactions to process. So you only need to be concerned about
bank conflicts within the lower or upper 16 threads of a warp.

On Fri, Mar 18, 2016 at 1:12 AM, Xiuxia Zhang notifications@github.com
wrote:

Hi:
I readed the SGEMM document through, I found the following narratation is
different from Nvidia's document.

In the wiki: https://github.com/NervanaSystems/maxas/wiki/SGEMM
It was said:
How do you load from shared using quad vectors without bank conflicts?
Well, according to the documentation, so long as all the accesses are
within 32 words (128 bytes), we're fine.

from CUDA C PROGRAMMING GUIDE V7.5 =>section G.5.3. Shared Memory
It was said:
A shared memory request for a warp does not generate a bank conflict
between two
threads that access any address within the same 32-bit word (even
though the two
addresses fall in the same bank): In that case, for read accesses, the
word is broadcast to
the requesting threads and for write accesses, each address is written by
only one of the
threads (which thread performs the write is undefined).

The difference is that 32 words and 32 bit word , which is right ?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#8

Thanks for sharing your discovery, Scott.
I have another question.
In 128x128 blocking case, when reading A from shared memory before computing,
tid 0, 2, 4, 6, 8, 10, 12, 14 access same 128 bit shared memory position of A(LDS.128 is used)
This exceeds 32 bit, will these LDS.128 instruction have shared memory bank conflicts ?

Nvidia's doc said: threads access shared memory within same 32-bit word will have no bank conflict.
Now in your code, you have several threads access same 128-bit words.
My question is will bank conflict happen in accessing same memory by using LDS.128 instruction?

--Xiuxia

Vector loads are an exception to that rule. There's no additional cost for
64 bit access, but 128bit access is implemented internally as a pair of
64bit reads and require 2 transactions at a minimum. So that behaves just
like a bank conflict. But it's still worth using as batching the reads is
more efficient for the hardware.

Also, if you don't broadcast the vector loads you could get additional bank
conflicts. Lets say your read address was this (with a 128 bit vector
load):

tid << 4

That's going to load 512 bytes and will require 4 transactions. Shared
memory can only deliver 128 unique bytes at a time. But it can deliver
more per thread if you broadcast.

I'd play with some simple cuda code and look at the "nvprof -m all" stats.
That will show you shared memory transaction counts.

On Sun, Mar 20, 2016 at 2:25 AM, Xiuxia Zhang notifications@github.com
wrote:

Thanks for sharing your discovery, Scott.

I have anther question.
In 128x128 blocking case, when reading A from shared memory before
computing,
tid 0, 2, 4, 6, 8, 10, 12, 14 access same 128 bit shared memory position
of A(LDS.128 is used)
This exceeds 32 bit, will these LDS.128 instruction have shared memory
bank conflicts ?

Nvidia's doc said: threads access shared memory within same 32-bit word
will have no bank conflict.
Now in your code, you have several threads access same 128-bit words.
My question is will bank conflict happen in accessing same memory by using
LDS.128 instruction?

--Xiuxia


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#8 (comment)