Having trouble replicating results from README on 96-core CPU

Question

Having trouble replicating results from README on 96-core CPU

geerlingguy opened this issue a year ago · 14 comments

I have just re-created the test bench scenario using a 96-core Ampere Altra Dev Workstation with 96 GB of RAM, running Ubuntu 20.04 server aarch64, with the following kernel:

root@ampere:/opt/hpl-2.3/bin/Altramax_oracleblis# uname -a
Linux ampere 5.4.0-155-generic #172-Ubuntu SMP Fri Jul 7 16:13:58 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

I have now replicated this setup with two clean installs (even going so far as removing all my NVMe drives, reformatting them, and re-installing Ubuntu 20.04 aarch64 twice for a completely fresh systsem).

And both times, I am getting around 980-1,000 Gflops following the explicit instructions in this repo each time (see also, geerlingguy/sbc-reviews#19).

My most recent run today, on a new fresh install:

root@ampere:/opt/hpl-2.3/bin/Altramax_oracleblis# mpirun -np 96 --allow-run-as-root --bind-to core --map-by core ./xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  105000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      12 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      105000   256     8    12             774.54             9.9642e+02
HPL_pdgesv() start time Sun Aug  6 22:00:59 2023

HPL_pdgesv() end time   Sun Aug  6 22:13:54 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.00850780e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

And the contents of the HPL.dat file:

root@ampere:/opt/hpl-2.3/bin/Altramax_oracleblis# cat HPL.dat 
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
105000       Ns
1            # of NBs
256          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
8            Ps
12           Qs
16.0         threshold
1            # of panel fact
2        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

According to the README, I should be getting over 1.2 Tflops using this same configuration.

Can you help me figure out what might be different between my test workstation setup and the one used to generate these results?

Answer 1 · 2023-08-16T09:44:41.000Z

Hi Jeff,
I have some questions about the setting of HPL.dat, you set the Ns parameter as 105000, but you also mentioned that the system has 96GB RAM, I am a bit confused about this, because I found this website which can input the system info and gets the HPL.dat parameter for your input, and just copy and paste to the HPL.dat.
https://www.advancedclustering.com/act_kb/tune-hpl-dat-file/
According to your HPL.dat configuration, the actual system memory of your system should be at least 100GB~110GB, or even more...
I am not sure about that, can you give me some advice?
Thanks
BR
ii-BOY

Answer 2 · 2023-09-07T20:17:14.000Z

@ii-BOY - It's slightly more complex than that, Ns is not 1:1 correlated to memory size, and finding the right parameters to make HPL use as much RAM as you have—but not too much—is mostly a matter of trial and error.

As this project's README states, with Ns at 105000, the RAM usage is around 91 GB, which is about ideal for a 96 GB RAM system, assuming it's only running the benchmark.

Answer 3 · 2023-09-07T20:23:36.000Z

Seeing that one of the Ampere devs who is benchmarking the same system (and who's numbers are used in the README) has gotten different results, we compared everything about our systems, and determined the only real difference is the memory.

I am currently running:

Transcend TS2GHR72V2E3 DDR4-3200 ECC RAM

And he is running Samsung ECC RAM, same spec though. You wouldn't think different vendors' RAM would cause a 20% performance difference (they are both similar down to CL22 CAS Latency...), but stranger things have happened.

So I've ordered six sticks of 16 GB Samsung M393A2K40DB3-CWE DDR4-3200 ECC RAM, and they should come in a day or two... then I'll re-run my tests and see if they're any faster with Samsung RAM.

Answer 4 · 2023-09-07T21:25:22.000Z

For a point of reference, I even tried forcing 3200 (instead of 'Auto') for the memory speed in the BIOS, and got the same result (+/- 1%), and here are the current memory speed results from tinymembench:

tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :   9424.0 MB/s
 C copy backwards (32 byte blocks)                    :   9387.8 MB/s
 C copy backwards (64 byte blocks)                    :   9390.8 MB/s
 C copy                                               :   9366.1 MB/s
 C copy prefetched (32 bytes step)                    :   9984.4 MB/s
 C copy prefetched (64 bytes step)                    :   9984.1 MB/s
 C 2-pass copy                                        :   6391.4 MB/s
 C 2-pass copy prefetched (32 bytes step)             :   7237.8 MB/s
 C 2-pass copy prefetched (64 bytes step)             :   7489.6 MB/s
 C fill                                               :  43884.4 MB/s
 C fill (shuffle within 16 byte blocks)               :  43885.4 MB/s
 C fill (shuffle within 32 byte blocks)               :  43884.2 MB/s
 C fill (shuffle within 64 byte blocks)               :  43877.5 MB/s
 NEON 64x2 COPY                                       :   9961.9 MB/s
 NEON 64x2x4 COPY                                     :  10091.6 MB/s
 NEON 64x1x4_x2 COPY                                  :   8171.5 MB/s
 NEON 64x2 COPY prefetch x2                           :  11822.9 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  12123.8 MB/s
 NEON 64x2 COPY prefetch x1                           :  11836.5 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  12122.3 MB/s
 ---
 standard memcpy                                      :   9894.0 MB/s
 standard memset                                      :  44745.2 MB/s
 ---
 NEON LDP/STP copy                                    :   9958.0 MB/s
 NEON LDP/STP copy pldl2strm (32 bytes step)          :  11415.6 MB/s
 NEON LDP/STP copy pldl2strm (64 bytes step)          :  11420.5 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :  11475.2 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :  11452.9 MB/s
 NEON LD1/ST1 copy                                    :  10094.8 MB/s
 NEON STP fill                                        :  44744.7 MB/s
 NEON STNP fill                                       :  44745.2 MB/s
 ARM LDP/STP copy                                     :  10136.4 MB/s
 ARM STP fill                                         :  44731.7 MB/s
 ARM STNP fill                                        :  44730.0 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.3 ns          /     1.8 ns 
    262144 :    2.3 ns          /     2.9 ns 
    524288 :    3.2 ns          /     3.9 ns 
   1048576 :    3.6 ns          /     4.2 ns 
   2097152 :   22.9 ns          /    33.0 ns 
   4194304 :   32.6 ns          /    40.9 ns 
   8388608 :   38.1 ns          /    43.5 ns 
  16777216 :   43.2 ns          /    48.6 ns 
  33554432 :   86.2 ns          /   112.2 ns 
  67108864 :  109.3 ns          /   135.2 ns 

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.3 ns          /     1.8 ns 
    262144 :    1.9 ns          /     2.3 ns 
    524288 :    2.2 ns          /     2.5 ns 
   1048576 :    2.6 ns          /     2.8 ns 
   2097152 :   21.6 ns          /    31.6 ns 
   4194304 :   31.1 ns          /    39.4 ns 
   8388608 :   35.8 ns          /    41.7 ns 
  16777216 :   38.5 ns          /    43.0 ns 
  33554432 :   79.9 ns          /   104.9 ns 
  67108864 :  101.1 ns          /   125.4 ns

Run with:

git clone https://github.com/rojaster/tinymembench.git && cd tinymembench && make
./tinymembench

Answer 5 · 2023-09-08T13:57:33.000Z

@geerlingguy I ran tinymembench on my machine and here the comparative results.
I am attaching the memory bandwith test results in this comment and latency results in the next comment.

Tests	JG-run	RB-run
C copy backwards	9424	12800
C copy backwards (32 byte blocks)	9387.8	12822.2
C copy backwards (64 byte blocks)	9390.8	12831.5
C copy	9366.1	12852.6
C copy prefetched (32 bytes step)	9984.4	13667.5
C copy prefetched (64 bytes step)	9984.1	13659.3
C 2-pass copy	6391.4	8234.1
C 2-pass copy prefetched (32 bytes step)	7237.8	10070.3
C 2-pass copy prefetched (64 bytes step)	7489.6	10563.4
NEON 64x2 COPY	9961.9	13638.6
NEON 64x2x4 COPY	10091.6	13725.5
NEON 64x1x4_x2 COPY	8171.5	10066.8
NEON 64x2 COPY prefetch x2	11822.9	15860.6
NEON 64x2x4 COPY prefetch x1	12123.8	16100.7
NEON 64x2x4 COPY prefetch x1	12122.3	16105
NEON 64x2 COPY prefetch x1	11836.5	15872.7
standard memcpy	9894	13527.2
NEON LDP/STP copy	9958	13628.1
NEON LDP/STP copy pldl2strm (32 bytes step)	11415.6	15147.7
NEON LDP/STP copy pldl2strm (64 bytes step)	11420.5	15257.2
NEON LDP/STP copy pldl1keep (32 bytes step)	11475.2	15448.9
NEON LDP/STP copy pldl1keep (64 bytes step)	11452.9	15423.7
NEON LD1/ST1 copy	10094.8	13753.5
ARM LDP/STP copy	10136.4	13765.1
C fill	43884.4	43888.9
C fill (shuffle within 16 byte blocks)	43885.4	43891.8
C fill (shuffle within 32 byte blocks)	43884.2	43888.6
C fill (shuffle within 64 byte blocks)	43877.5	43875.3
standard memset	44745.2	44758
NEON STP fill	44744.7	44755.1
NEON STNP fill	44745.2	44749.3
ARM STP fill	44731.7	44723.3
ARM STNP fill	44730	44705.9

Except for the last 9 tests, my machine seems to be outperforming by 20% . Also attaching a graphical representation of the same.

Note: THe last 9 tests have not been mapped in the graph since they are within acceptable ranges of each other

Answer 6 · 2023-09-08T14:08:04.000Z

@geerlingguy The next part of the test was the memory latency test. The results are as below
Run 1 with MADV_NOHUGEPAGE

Run 1
block size	JG-single_run	JG-dual_run	RB-single_run	RB-dual_run
1024	0	0	0	0
2048	0	0	0	0
4096	0	0	0	0
8192	0	0	0	0
16384	0	0	0	0
32768	0	0	0	0
65536	0	0	0	0
131072	1.3	1.8	1.3	1.8
262144	2.3	2.9	2.4	3
524288	3.2	3.9	3.4	3.9
1048576	3.6	4.2	4.1	4.5
2097152	22.9	33	17.8	24.8
4194304	32.6	40.9	25	30.5
8388608	38.1	43.5	30.1	35
16777216	43.2	48.6	37	45.7
33554432	86.2	112.2	71.2	93.7
67108864	109.3	135.2	91.4	112.1

Run 2 with MADV_HUGEPAGE

Run 2
block size	JG-single_run	JG-dual_run	RB-single_run	RB-dual_run
1024	0	0	0	0
2048	0	0	0	0
4096	0	0	0	0
8192	0	0	0	0
16384	0	0	0	0
32768	0	0	0	0
65536	0	0	0	0
131072	1.3	1.8	1.3	1.8
262144	1.9	2.3	1.9	2.4
524288	2.2	2.5	2.3	2.5
1048576	2.6	2.8	2.6	2.8
2097152	21.6	31.6	16.2	23.1
4194304	31.1	39.4	23.3	28.7
8388608	35.8	41.7	26.7	30.4
16777216	38.5	43	28.3	31.5
33554432	79.9	104.9	64.9	85.9
67108864	101.1	125.4	83.7	102.9

I mapped one of the runs into a graph as seen below

As seen with other latency benchmarks, we're good when comparing L1 and L2 cache. The differences start popping up as we move from L2 Cache to system memory. Once again my results are ~20% faster.

Answer 7 · 2023-09-08T18:33:11.000Z

Wow, what a difference the memory seems to make!

I got 2 of the 6 new RAM sticks just now. Running HPL with N=50000, I see:

Old Transcend RAM (2x16 GB): 279.22 Gflops
New Samsung RAM (2x16 GB): 369.05 Gflops

Encouraging early result! The rest of the RAM is coming Monday...

And here are the new tinymembench results (NOTE: Just for the 2x16GB sticks, performance will differ filling all the memory channels...):

Click to expand tinymembench results

tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :  10261.8 MB/s
 C copy backwards (32 byte blocks)                    :  10233.9 MB/s
 C copy backwards (64 byte blocks)                    :  10238.1 MB/s
 C copy                                               :  10277.0 MB/s
 C copy prefetched (32 bytes step)                    :  10403.7 MB/s
 C copy prefetched (64 bytes step)                    :  10407.1 MB/s
 C 2-pass copy                                        :   7065.6 MB/s
 C 2-pass copy prefetched (32 bytes step)             :   8825.9 MB/s
 C 2-pass copy prefetched (64 bytes step)             :   9179.0 MB/s
 C fill                                               :  42770.6 MB/s (1.1%)
 C fill (shuffle within 16 byte blocks)               :  42675.3 MB/s
 C fill (shuffle within 32 byte blocks)               :  42755.8 MB/s (0.2%)
 C fill (shuffle within 64 byte blocks)               :  42587.5 MB/s
 NEON 64x2 COPY                                       :  10633.4 MB/s
 NEON 64x2x4 COPY                                     :  10679.9 MB/s
 NEON 64x1x4_x2 COPY                                  :   6380.2 MB/s (0.1%)
 NEON 64x2 COPY prefetch x2                           :  12576.1 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  12767.1 MB/s
 NEON 64x2 COPY prefetch x1                           :  12462.2 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  12763.3 MB/s
 ---
 standard memcpy                                      :  10582.3 MB/s
 standard memset                                      :  42988.5 MB/s (1.3%)
 ---
 NEON LDP/STP copy                                    :  10645.9 MB/s
 NEON LDP/STP copy pldl2strm (32 bytes step)          :  11909.5 MB/s
 NEON LDP/STP copy pldl2strm (64 bytes step)          :  11902.6 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :  11816.3 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :  11818.2 MB/s
 NEON LD1/ST1 copy                                    :  10690.8 MB/s
 NEON STP fill                                        :  43059.6 MB/s (1.2%)
 NEON STNP fill                                       :  43150.2 MB/s (0.3%)
 ARM LDP/STP copy                                     :  10711.8 MB/s
 ARM STP fill                                         :  43011.2 MB/s (1.1%)
 ARM STNP fill                                        :  43117.3 MB/s (0.2%)

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.3 ns          /     1.8 ns 
    262144 :    2.4 ns          /     2.9 ns 
    524288 :    3.4 ns          /     3.9 ns 
   1048576 :    7.7 ns          /    11.3 ns 
   2097152 :   20.5 ns          /    29.5 ns 
   4194304 :   28.9 ns          /    36.7 ns 
   8388608 :   35.7 ns          /    41.7 ns 
  16777216 :   45.2 ns          /    55.4 ns 
  33554432 :   74.5 ns          /    95.5 ns 
  67108864 :   89.0 ns          /   107.1 ns 

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.3 ns          /     1.8 ns 
    262144 :    1.9 ns          /     2.3 ns 
    524288 :    2.3 ns          /     2.5 ns 
   1048576 :    2.6 ns          /     2.8 ns 
   2097152 :   19.1 ns          /    27.8 ns 
   4194304 :   27.6 ns          /    35.0 ns 
   8388608 :   31.4 ns          /    37.3 ns 
  16777216 :   33.6 ns          /    38.6 ns 
  33554432 :   67.7 ns          /    87.8 ns 
  67108864 :   80.6 ns          /    97.5 ns

memcpy goes from 9894.0 to 10582.3, a 7% difference (again, with 2 sticks vs 6), while HPL goes from 279 to 369, almost a 30% improvement! Latency is vastly improved over the Transcend RAM as well.

Can't wait for the other sticks to arrive. I will finally pass the 'teraflop on a CPU' barrier :)

Answer 8 · 2023-09-08T20:04:45.000Z

I have a Twitter (X?) thread going on about the memory differences. Going to also try to see if I can look up timing data in Linux via decode-dimms (CPU-Z under Windows on Arm isn't showing timing data).

Answer 9 · 2023-09-08T20:18:52.000Z

Hmm...

$ sudo apt install -y i2c-tools
$ sudo modprobe eeprom
$ decode-dimms
# decode-dimms version 4.3

Memory Serial Presence Detect Decoder
By Philip Edelbrock, Christian Zuckschwerdt, Burkart Lingner,
Jean Delvare, Trent Piepho and others


Number of SDRAM DIMMs detected and decoded: 0

Answer 10 · 2023-09-11T05:32:07.000Z

@geerlingguy The next part of the test was the memory latency test. The results are as below

Run 1
block size JG-single_run JG-dual_run RB-single_run RB-dual_run
1024 0 0 0 0
2048 0 0 0 0
4096 0 0 0 0
8192 0 0 0 0
16384 0 0 0 0
32768 0 0 0 0
65536 0 0 0 0
131072 1.3 1.8 1.3 1.8
262144 2.3 2.9 2.4 3
524288 3.2 3.9 3.4 3.9
1048576 3.6 4.2 4.1 4.5
2097152 22.9 33 17.8 24.8
4194304 32.6 40.9 25 30.5
8388608 38.1 43.5 30.1 35
16777216 43.2 48.6 37 45.7
33554432 86.2 112.2 71.2 93.7
67108864 109.3 135.2 91.4 112.1
Run 2
block size JG-single_run JG-dual_run RB-single_run RB-dual_run
1024 0 0 0 0
2048 0 0 0 0
4096 0 0 0 0
8192 0 0 0 0
16384 0 0 0 0
32768 0 0 0 0
65536 0 0 0 0
131072 1.3 1.8 1.3 1.8
262144 1.9 2.3 1.9 2.4
524288 2.2 2.5 2.3 2.5
1048576 2.6 2.8 2.6 2.8
2097152 21.6 31.6 16.2 23.1
4194304 31.1 39.4 23.3 28.7
8388608 35.8 41.7 26.7 30.4
16777216 38.5 43 28.3 31.5
33554432 79.9 104.9 64.9 85.9
67108864 101.1 125.4 83.7 102.9
I mapped one of the runs into a graph as seen below

As seen with other latency benchmarks, we're good when comparing L1 and L2 cache. The differences start popping up as we move from L2 Cache to system memory. Once again my results are ~20% faster.

Hi Jeef,
I am not sure about the Run1 and Run2 meaning, I tried to do the tinymembench and I saw [MADV_NOHUGEPAGE] at 1st run the is [MADV_HUGEPAGE], so you did the tinymembench with same configuration for 2 times or you did 1 time but get 2 result(HUGEPAGE and NOHUGEPAGE)?
Thanks
BR
ii-BOY

Answer 11 · 2023-09-11T10:22:32.000Z

@ii-BOY . Hi this test was run just once.
Internally, tinymembench ran twice. Once with THP disabled (MADV_NOHUGEPAGE) and once with THP enabled (MADV_NOHUGEPAGE).
You can find more information here : https://man7.org/linux/man-pages/man2/madvise.2.html

Thanks.
Note : Thanks for pointing out the omission in descriptions for Run1 and Run2 tables that were posted. I've edited those to reflect MADV_NOHUGEPAGE and MADV_HUGEPAGE in their tables respectively.

Answer 12 · 2023-09-11T20:17:49.000Z

tinymembench run with all six sticks (96 GB total) of Sam sung RAM:

Click to view tinymembench results

tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :  11416.8 MB/s
 C copy backwards (32 byte blocks)                    :  11374.5 MB/s
 C copy backwards (64 byte blocks)                    :  11380.7 MB/s
 C copy                                               :  11486.6 MB/s
 C copy prefetched (32 bytes step)                    :  12074.4 MB/s
 C copy prefetched (64 bytes step)                    :  12072.5 MB/s
 C 2-pass copy                                        :   7456.1 MB/s
 C 2-pass copy prefetched (32 bytes step)             :   8489.5 MB/s
 C 2-pass copy prefetched (64 bytes step)             :   8901.7 MB/s
 C fill                                               :  43888.0 MB/s
 C fill (shuffle within 16 byte blocks)               :  43888.0 MB/s
 C fill (shuffle within 32 byte blocks)               :  43888.3 MB/s
 C fill (shuffle within 64 byte blocks)               :  43882.9 MB/s
 NEON 64x2 COPY                                       :  12176.6 MB/s
 NEON 64x2x4 COPY                                     :  12229.0 MB/s
 NEON 64x1x4_x2 COPY                                  :  10022.1 MB/s
 NEON 64x2 COPY prefetch x2                           :  13542.4 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  13902.0 MB/s
 NEON 64x2 COPY prefetch x1                           :  13579.6 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  13903.2 MB/s
 ---
 standard memcpy                                      :  12107.0 MB/s
 standard memset                                      :  44746.4 MB/s
 ---
 NEON LDP/STP copy                                    :  12186.3 MB/s
 NEON LDP/STP copy pldl2strm (32 bytes step)          :  13778.2 MB/s
 NEON LDP/STP copy pldl2strm (64 bytes step)          :  13785.9 MB/s
 NEON LDP/STP copy pldl1keep (32 bytes step)          :  13847.4 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :  13825.8 MB/s
 NEON LD1/ST1 copy                                    :  12242.3 MB/s
 NEON STP fill                                        :  44745.9 MB/s
 NEON STNP fill                                       :  44747.5 MB/s
 ARM LDP/STP copy                                     :  12298.1 MB/s
 ARM STP fill                                         :  44730.0 MB/s
 ARM STNP fill                                        :  44730.8 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.3 ns          /     1.8 ns 
    262144 :    2.4 ns          /     2.9 ns 
    524288 :    3.4 ns          /     3.9 ns 
   1048576 :    4.1 ns          /     4.4 ns 
   2097152 :   23.2 ns          /    33.2 ns 
   4194304 :   32.7 ns          /    41.1 ns 
   8388608 :   39.7 ns          /    46.2 ns 
  16777216 :   47.7 ns          /    51.0 ns 
  33554432 :   81.6 ns          /   103.5 ns 
  67108864 :  102.1 ns          /   122.2 ns 

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.3 ns          /     1.8 ns 
    262144 :    1.9 ns          /     2.3 ns 
    524288 :    2.3 ns          /     2.5 ns 
   1048576 :    2.6 ns          /     2.8 ns 
   2097152 :   21.6 ns          /    31.6 ns 
   4194304 :   31.4 ns          /    39.4 ns 
   8388608 :   36.2 ns          /    41.7 ns 
  16777216 :   38.5 ns          /    43.0 ns 
  33554432 :   74.8 ns          /    95.7 ns 
  67108864 :   93.6 ns          /   112.0 ns

Answer 13 · 2023-09-11T20:34:18.000Z

New result:

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  105000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      12 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      105000   256     8    12             649.46             1.1883e+03
HPL_pdgesv() start time Mon Sep 11 20:21:22 2023

HPL_pdgesv() end time   Mon Sep 11 20:32:11 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.00850780e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

1188.3 Gflops at 296W = 4.01 Gflops/W

Answer 14 · 2023-09-11T20:34:52.000Z

It seems like my Samsung RAM still performs just under whatever RAM @rbapat-ampere is using in his system, so that seems to explain the delta!

I think this issue can be closed, as we've found the culprit.