Liu-xiandong/How_to_optimize_in_GPU

severe performance degradation

XG-zheng opened this issue · 4 comments

Hi!I am new to cuda。I tried to reproduce gemm_V2 baed on your ZhiHu article.But I found that the performance of code(sgemm_v2.cu) is seriously degraded after slight modification.
I just modified the following code.

#pragma unroll
    for (int thread_y = 0; thread_y < THREAD_SIZE_Y; thread_y += 4) {
        FETCH_FLOAT4(frag_a[0][thread_y]) = FETCH_FLOAT4(As[0][0][THREAD_SIZE_Y * ty + thread_y]);
    }
    // load B from shared memory to register
    #pragma unroll
    for (int thread_x = 0; thread_x < THREAD_SIZE_X; thread_x += 4) {
        FETCH_FLOAT4(frag_b[0][thread_x]) = FETCH_FLOAT4(Bs[0][0][THREAD_SIZE_X * tx + thread_x]);
    }

After modified:

 #pragma unroll
    for (int thread_y = THREAD_SIZE_Y * ty; thread_y < THREAD_SIZE_Y * (ty + 1); thread_y += 4) {
        FETCH_FLOAT4(frag_a[0][thread_y - THREAD_SIZE_Y * ty]) = FETCH_FLOAT4(As[0][0][thread_y]);
    }
    load B from shared memory to register

    #pragma unroll
    for (int thread_x = THREAD_SIZE_X * tx; thread_x < THREAD_SIZE_X*(tx + 1); thread_x += 4) {
        FETCH_FLOAT4(frag_b[0][thread_x - THREAD_SIZE_X * tx]) = FETCH_FLOAT4(Bs[0][0][thread_x]);
    }

I tested the performance of the case 2048-2048-2048 on V100.The performance before and after modification is 1.7ms and 7.3ms respectively.
Before:
image
After:
image

Hello, there are three reasons for the performance degradation caused by your changes in this part.

  1. Some multiplication and addition operations are added to the for loop, which runs relatively slowly in the GPU;
  2. This for loop cannot be unrolled in the GPU, which leads to dependencies on instructions, which further reduces performance;
  3. Since the shared mem can only be performed after the multiplication and addition, that is, the memory access of the shared mem depends on the completion of the previous part, which leads to an increase in the memory access time of the shared mem, the memory access is slow, the calculation is more difficult to be filled, and the overall performance goes down. .
    In general, the reasons for Parts 2 and 3 should dominate.

中文:
你好,你这部分的改动导致性能下降有三方面的原因。
1、for循环中增加了一些乘加运算,这部分在GPU中的运行速度比较慢;
2、这个for循环在GPU中没有办法进行unroll,导致指令有依赖,进一步降低了性能;
3、由于乘加之后才能进行shared mem,就是shared mem的访存依赖于前面这部分完成,导致shared mem访存的时间增加,访存慢了,计算就更难被填满,整体性能就下去了。
总的来说,应该是第2和第3部分的原因占主导。

因为工作原因,在github上回复较慢,感谢你的提问~

Hello, there are three reasons for the performance degradation caused by your changes in this part.

  1. Some multiplication and addition operations are added to the for loop, which runs relatively slowly in the GPU;
  2. This for loop cannot be unrolled in the GPU, which leads to dependencies on instructions, which further reduces performance;
  3. Since the shared mem can only be performed after the multiplication and addition, that is, the memory access of the shared mem depends on the completion of the previous part, which leads to an increase in the memory access time of the shared mem, the memory access is slow, the calculation is more difficult to be filled, and the overall performance goes down. .
    In general, the reasons for Parts 2 and 3 should dominate.

中文: 你好,你这部分的改动导致性能下降有三方面的原因。 1、for循环中增加了一些乘加运算,这部分在GPU中的运行速度比较慢; 2、这个for循环在GPU中没有办法进行unroll,导致指令有依赖,进一步降低了性能; 3、由于乘加之后才能进行shared mem,就是shared mem的访存依赖于前面这部分完成,导致shared mem访存的时间增加,访存慢了,计算就更难被填满,整体性能就下去了。 总的来说,应该是第2和第3部分的原因占主导。

你好,非常感谢你详细的回答!但仍有一些疑惑希望能向你请教一下。
1.issue中提到改动,只改动了 112-120行(sgemm_v2.cu) 这部分,即第一次冷加载数据,并没有改动do{}while循环中的数据加载。(而且这部分的循环大小实际只有2)
2.试过将原代码中的这部分的循环展开去掉,性能并未明显下降。
3.尝试使用nsight compute分析,确实改动后实际带宽严重下降。
因此仍未分析出具体的性能下降原因。

Hello, there are three reasons for the performance degradation caused by your changes in this part.

  1. Some multiplication and addition operations are added to the for loop, which runs relatively slowly in the GPU;
  2. This for loop cannot be unrolled in the GPU, which leads to dependencies on instructions, which further reduces performance;
  3. Since the shared mem can only be performed after the multiplication and addition, that is, the memory access of the shared mem depends on the completion of the previous part, which leads to an increase in the memory access time of the shared mem, the memory access is slow, the calculation is more difficult to be filled, and the overall performance goes down. .
    In general, the reasons for Parts 2 and 3 should dominate.

中文: 你好,你这部分的改动导致性能下降有三方面的原因。 1、for循环中增加了一些乘加运算,这部分在GPU中的运行速度比较慢; 2、这个for循环在GPU中没有办法进行unroll,导致指令有依赖,进一步降低了性能; 3、由于乘加之后才能进行shared mem,就是shared mem的访存依赖于前面这部分完成,导致shared mem访存的时间增加,访存慢了,计算就更难被填满,整体性能就下去了。 总的来说,应该是第2和第3部分的原因占主导。

你好,非常感谢你详细的回答!但仍有一些疑惑希望能向你请教一下。 1.issue中提到改动,只改动了 112-120行(sgemm_v2.cu) 这部分,即第一次冷加载数据,并没有改动do{}while循环中的数据加载。(而且这部分的循环大小实际只有2) 2.试过将原代码中的这部分的循环展开去掉,性能并未明显下降。 3.尝试使用nsight compute分析,确实改动后实际带宽严重下降。 因此仍未分析出具体的性能下降原因。

你好,
1、增加的乘加指令指的是你增加的这部分,就是for循环中增加的,但是这个应该不是主要因素,所以这一条可以忽略;
2、原来的实现方式,for循环的次数是固定的,编译器是可以感知到,并进行展开。所以去掉#pragma unroll也能有效展开。但是你的更改之后,引入了tx这样的未知量,编译器不能进行有效的展开。
3、改动后实际带宽严重下降符合我的预期。
可以试着对比一下两个版本的SASS代码,应该可以看出明显的区别。应该会增加一些FFMA指令,我的版本编译出来是512条FFMA指令。