/hpc-hw2-solution

NYU HPC 23 Homework 2 submission

Primary LanguageC++

Report for homework2

Juexiao Zhang

experiment machine: AMD Ryzen 7 5700G with Radeon Graphics, which has 8 cpu cores, and a cache size of 512 KB. System: windows sub Linux system Ubuntu 22.

Question 1


please see the comments in the corresponding code files

Question 2


The timings of different settings can be found in the log directory mmult-logs. Specify as follows:

  • O3Bx.log: timings for different matrix sizes under optimization flag -O3, with block size x
  • O3ref.log: timings of running function MMult0 (the original version) under -O3 as a reference.
  • O3omp1.log: openMP version with -O3 and OMP_NUM_THREADS=1

Among the different block sizes I found 32 is the best.

Also note that the MMult0 under -O3 have a rather good performance and this is because the compiler does a great deal of optimization. We can see a more obvious speedup under -O0:

MMult0 under -O0:

Dimension       Time    Gflop/s       GB/s        Error
        32   3.386950   0.590510   0.608963 0.000000e+00
        64   3.431868   0.582819   0.601032 0.000000e+00
        96   3.393147   0.589798   0.608230 0.000000e+00
       128   3.402752   0.587960   0.606334 0.000000e+00
       160   3.415354   0.587652   0.606016 0.000000e+00
       192   3.419784   0.587792   0.606160 0.000000e+00

Block MMult1 under -O0:

Dimension       Time    Gflop/s       GB/s        Error
        32   1.407440   1.421039   1.465447 0.000000e+00
        64   1.418946   1.409609   1.453659 0.000000e+00
        96   1.422450   1.406920   1.450886 0.000000e+00
       128   1.451882   1.377993   1.421055 0.000000e+00
       160   1.427472   1.406010   1.449948 0.000000e+00
       192   1.437757   1.398094   1.441785 0.000000e+00

Question 3


AVX version for sin4_intrin() and sin_vec() is implemented in the source file.

Question 4


(a)

Experiment logs are in the directory pipeline-logs:

  • allfuncs.log shows timings and flop rates of the different functions discussed in the book. We can see that unrolling by 4 is faster than by 2.
  • unroll2sizes.log and unroll4sizes.log shows the timings of unrolling 2 and unrolling 4 for different vector lengths respectively. We can see that length 4194304 is where the time has a sudden increase. This indicates the vectors can no longer be fit into the cache.

(b)

  • compute.cpp

when running the command suggested in the comment, the outputs of compute.cpp under different optimization flags: multiply-add:

O3 O2 O1 O0
0.878982 seconds 0.878103 seconds 2.862484 seconds 3.334673 seconds
2.900860 cycles/eval 2.897913 cycles/eval 9.446394 cycles/eval 11.004615 cycles/eval
2.275184 Gflop/s 2.277497 Gflop/s 0.698679 Gflop/s 0.599748 Gflop/s

division:

O3 O2 O1 O0
2.959592 seconds 2.975849 seconds 4.509251 seconds 4.930273 seconds
9.766868 cycles/eval 9.820526 cycles/eval 14.880718 cycles/eval 16.270086 cycles/eval
0.675754 Gflop/s 0.672061 Gflop/s 0.443527 Gflop/s 0.405652 Gflop/s

sqrt:

O3 O2 O1 O0
4.499548 seconds 4.513189 seconds 6.493499 seconds 6.168601 seconds
14.848636 cycles/eval 14.893717 cycles/eval 21.428766 cycles/eval 20.356540 cycles/eval
0.444485 Gflop/s 0.443140 Gflop/s 0.307997 Gflop/s 0.324220 Gflop/s

sin:

O3 O2 O1 O0
7.132642 seconds 7.154899 seconds 8.166835 seconds 8.259182 seconds
23.537874 cycles/eval 23.611318 cycles/eval 26.950735 cycles/eval 27.255557 cycles/eval
0.280399 Gflop/s 0.279527 Gflop/s 0.244891 Gflop/s 0.242152 Gflop/s
  • compute-vec.cpp

    when using optimization flag -O3 with implicit vectorisation, the output is:

     compute-vec.cpp:16:21: optimized: loop vectorized using 32 byte vectors
     compute-vec.cpp:16:21: optimized:  loop versioned for vectorization because of possible aliasing
         compute-vec.cpp:52:21: optimized: loop vectorized using 16 byte vectors
         compute-vec.cpp:46:5: optimized: basic block part vectorized using 32 byte vectors
     time = 0.879205
     flop-rate = 9.098600 Gflop/s
    
     time = 0.878442
     flop-rate = 9.106789 Gflop/s
    
     time = 0.881654
     flop-rate = 9.073573 Gflop/s
    

    Note that the computation is 4 times more than the computation of compute.cpp since the vector length is 4 and is repeated for the same amount of time. So the auto-vectorization gives 4 times speedup compared with the compute.cpp. Experiments show that the pragmas unroll and GCC ivdep does not change the performance. This suggest that the compiler has already done the works suggest by the pragmas when we compile with the implicit vectorization flags.

    Using openMP, for optimization flag -O3, the output is:

    time = 0.880678
    flop-rate = 9.083412 Gflop/s
    
    time = 0.881578
    flop-rate = 9.074323 Gflop/s
    
    time = 0.879792
    flop-rate = 9.092741 Gflop/s
    

    Implicit and explicit vectorization give roughly the same performance. And such observation is consistent across different optimization flags. This makes sense since they are doing the same vectorization and can obtain the same level of parallelism.

    When using flag -O0 and -O1, there is a clear speed difference: fn0 > fn1 > fn2 while they are basically the same when using flags -O2 and -O3. This shows that vector intrinsics gain the most benefits from compiler optimization.

  • compute-vec-pipe.cpp

    Implicit Vectorization: flop rate (Gflop/s) for all the three functions with different M:

    M 1 2 4 8 16 32
    fn0 9.058724 18.194633 36.360510 56.787868 57.599702 31.552505
    fn1 9.091181 18.163392 36.279317 55.057205 32.607620 32.051048
    fn2 9.104850 18.153487 36.271088 55.028363 32.438479 30.974262

    The performances show that M = 8 gives the peak performance on this 8-core machine.

    openMP: flop rate (Gflop/s) for all the three functions with different M:

    M 1 2 4 8 16 32
    fn0 9.093459 18.220608 36.264081 32.138932 57.980313 32.242794
    fn1 9.112367 18.167002 35.990228 35.780188 31.934670 34.335481
    fn2 9.107717 18.215529 36.141719 35.716261 31.918655 34.058868

    OpenMP gives a slightly different results especially when M=8, and this may be because the implicit vectorization is more compatible with the specific hardware while openMP induces some extra cost for being more general.