Having trouble replicating results from README on 96-core CPU
geerlingguy opened this issue · 14 comments
I have just re-created the test bench scenario using a 96-core Ampere Altra Dev Workstation with 96 GB of RAM, running Ubuntu 20.04 server aarch64, with the following kernel:
root@ampere:/opt/hpl-2.3/bin/Altramax_oracleblis# uname -a
Linux ampere 5.4.0-155-generic #172-Ubuntu SMP Fri Jul 7 16:13:58 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
I have now replicated this setup with two clean installs (even going so far as removing all my NVMe drives, reformatting them, and re-installing Ubuntu 20.04 aarch64 twice for a completely fresh systsem).
And both times, I am getting around 980-1,000 Gflops following the explicit instructions in this repo each time (see also, geerlingguy/sbc-reviews#19).
My most recent run today, on a new fresh install:
root@ampere:/opt/hpl-2.3/bin/Altramax_oracleblis# mpirun -np 96 --allow-run-as-root --bind-to core --map-by core ./xhpl
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 105000
NB : 256
PMAP : Row-major process mapping
P : 8
Q : 12
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 105000 256 8 12 774.54 9.9642e+02
HPL_pdgesv() start time Sun Aug 6 22:00:59 2023
HPL_pdgesv() end time Sun Aug 6 22:13:54 2023
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 2.00850780e-03 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
And the contents of the HPL.dat file:
root@ampere:/opt/hpl-2.3/bin/Altramax_oracleblis# cat HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
105000 Ns
1 # of NBs
256 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
8 Ps
12 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
According to the README, I should be getting over 1.2 Tflops using this same configuration.
Can you help me figure out what might be different between my test workstation setup and the one used to generate these results?
Hi Jeff,
I have some questions about the setting of HPL.dat, you set the Ns parameter as 105000, but you also mentioned that the system has 96GB RAM, I am a bit confused about this, because I found this website which can input the system info and gets the HPL.dat parameter for your input, and just copy and paste to the HPL.dat.
https://www.advancedclustering.com/act_kb/tune-hpl-dat-file/
According to your HPL.dat configuration, the actual system memory of your system should be at least 100GB~110GB, or even more...
I am not sure about that, can you give me some advice?
Thanks
BR
ii-BOY
@ii-BOY - It's slightly more complex than that, Ns is not 1:1 correlated to memory size, and finding the right parameters to make HPL use as much RAM as you have—but not too much—is mostly a matter of trial and error.
As this project's README states, with Ns at 105000, the RAM usage is around 91 GB, which is about ideal for a 96 GB RAM system, assuming it's only running the benchmark.
Seeing that one of the Ampere devs who is benchmarking the same system (and who's numbers are used in the README) has gotten different results, we compared everything about our systems, and determined the only real difference is the memory.
I am currently running:
And he is running Samsung ECC RAM, same spec though. You wouldn't think different vendors' RAM would cause a 20% performance difference (they are both similar down to CL22 CAS Latency...), but stranger things have happened.
So I've ordered six sticks of 16 GB Samsung M393A2K40DB3-CWE DDR4-3200 ECC RAM, and they should come in a day or two... then I'll re-run my tests and see if they're any faster with Samsung RAM.
For a point of reference, I even tried forcing 3200 (instead of 'Auto') for the memory speed in the BIOS, and got the same result (+/- 1%), and here are the current memory speed results from tinymembench:
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)
==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for 'copy' tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==
==========================================================================
C copy backwards : 9424.0 MB/s
C copy backwards (32 byte blocks) : 9387.8 MB/s
C copy backwards (64 byte blocks) : 9390.8 MB/s
C copy : 9366.1 MB/s
C copy prefetched (32 bytes step) : 9984.4 MB/s
C copy prefetched (64 bytes step) : 9984.1 MB/s
C 2-pass copy : 6391.4 MB/s
C 2-pass copy prefetched (32 bytes step) : 7237.8 MB/s
C 2-pass copy prefetched (64 bytes step) : 7489.6 MB/s
C fill : 43884.4 MB/s
C fill (shuffle within 16 byte blocks) : 43885.4 MB/s
C fill (shuffle within 32 byte blocks) : 43884.2 MB/s
C fill (shuffle within 64 byte blocks) : 43877.5 MB/s
NEON 64x2 COPY : 9961.9 MB/s
NEON 64x2x4 COPY : 10091.6 MB/s
NEON 64x1x4_x2 COPY : 8171.5 MB/s
NEON 64x2 COPY prefetch x2 : 11822.9 MB/s
NEON 64x2x4 COPY prefetch x1 : 12123.8 MB/s
NEON 64x2 COPY prefetch x1 : 11836.5 MB/s
NEON 64x2x4 COPY prefetch x1 : 12122.3 MB/s
---
standard memcpy : 9894.0 MB/s
standard memset : 44745.2 MB/s
---
NEON LDP/STP copy : 9958.0 MB/s
NEON LDP/STP copy pldl2strm (32 bytes step) : 11415.6 MB/s
NEON LDP/STP copy pldl2strm (64 bytes step) : 11420.5 MB/s
NEON LDP/STP copy pldl1keep (32 bytes step) : 11475.2 MB/s
NEON LDP/STP copy pldl1keep (64 bytes step) : 11452.9 MB/s
NEON LD1/ST1 copy : 10094.8 MB/s
NEON STP fill : 44744.7 MB/s
NEON STNP fill : 44745.2 MB/s
ARM LDP/STP copy : 10136.4 MB/s
ARM STP fill : 44731.7 MB/s
ARM STNP fill : 44730.0 MB/s
==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can't handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==
==========================================================================
block size : single random read / dual random read, [MADV_NOHUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 1.3 ns / 1.8 ns
262144 : 2.3 ns / 2.9 ns
524288 : 3.2 ns / 3.9 ns
1048576 : 3.6 ns / 4.2 ns
2097152 : 22.9 ns / 33.0 ns
4194304 : 32.6 ns / 40.9 ns
8388608 : 38.1 ns / 43.5 ns
16777216 : 43.2 ns / 48.6 ns
33554432 : 86.2 ns / 112.2 ns
67108864 : 109.3 ns / 135.2 ns
block size : single random read / dual random read, [MADV_HUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 1.3 ns / 1.8 ns
262144 : 1.9 ns / 2.3 ns
524288 : 2.2 ns / 2.5 ns
1048576 : 2.6 ns / 2.8 ns
2097152 : 21.6 ns / 31.6 ns
4194304 : 31.1 ns / 39.4 ns
8388608 : 35.8 ns / 41.7 ns
16777216 : 38.5 ns / 43.0 ns
33554432 : 79.9 ns / 104.9 ns
67108864 : 101.1 ns / 125.4 ns
Run with:
git clone https://github.com/rojaster/tinymembench.git && cd tinymembench && make
./tinymembench
@geerlingguy I ran tinymembench on my machine and here the comparative results.
I am attaching the memory bandwith test results in this comment and latency results in the next comment.
Tests | JG-run | RB-run |
---|---|---|
C copy backwards | 9424 | 12800 |
C copy backwards (32 byte blocks) | 9387.8 | 12822.2 |
C copy backwards (64 byte blocks) | 9390.8 | 12831.5 |
C copy | 9366.1 | 12852.6 |
C copy prefetched (32 bytes step) | 9984.4 | 13667.5 |
C copy prefetched (64 bytes step) | 9984.1 | 13659.3 |
C 2-pass copy | 6391.4 | 8234.1 |
C 2-pass copy prefetched (32 bytes step) | 7237.8 | 10070.3 |
C 2-pass copy prefetched (64 bytes step) | 7489.6 | 10563.4 |
NEON 64x2 COPY | 9961.9 | 13638.6 |
NEON 64x2x4 COPY | 10091.6 | 13725.5 |
NEON 64x1x4_x2 COPY | 8171.5 | 10066.8 |
NEON 64x2 COPY prefetch x2 | 11822.9 | 15860.6 |
NEON 64x2x4 COPY prefetch x1 | 12123.8 | 16100.7 |
NEON 64x2x4 COPY prefetch x1 | 12122.3 | 16105 |
NEON 64x2 COPY prefetch x1 | 11836.5 | 15872.7 |
standard memcpy | 9894 | 13527.2 |
NEON LDP/STP copy | 9958 | 13628.1 |
NEON LDP/STP copy pldl2strm (32 bytes step) | 11415.6 | 15147.7 |
NEON LDP/STP copy pldl2strm (64 bytes step) | 11420.5 | 15257.2 |
NEON LDP/STP copy pldl1keep (32 bytes step) | 11475.2 | 15448.9 |
NEON LDP/STP copy pldl1keep (64 bytes step) | 11452.9 | 15423.7 |
NEON LD1/ST1 copy | 10094.8 | 13753.5 |
ARM LDP/STP copy | 10136.4 | 13765.1 |
C fill | 43884.4 | 43888.9 |
C fill (shuffle within 16 byte blocks) | 43885.4 | 43891.8 |
C fill (shuffle within 32 byte blocks) | 43884.2 | 43888.6 |
C fill (shuffle within 64 byte blocks) | 43877.5 | 43875.3 |
standard memset | 44745.2 | 44758 |
NEON STP fill | 44744.7 | 44755.1 |
NEON STNP fill | 44745.2 | 44749.3 |
ARM STP fill | 44731.7 | 44723.3 |
ARM STNP fill | 44730 | 44705.9 |
Except for the last 9 tests, my machine seems to be outperforming by 20% . Also attaching a graphical representation of the same.
Note: THe last 9 tests have not been mapped in the graph since they are within acceptable ranges of each other
@geerlingguy The next part of the test was the memory latency test. The results are as below
Run 1 with MADV_NOHUGEPAGE
Run 1 | ||||
---|---|---|---|---|
block size | JG-single_run | JG-dual_run | RB-single_run | RB-dual_run |
1024 | 0 | 0 | 0 | 0 |
2048 | 0 | 0 | 0 | 0 |
4096 | 0 | 0 | 0 | 0 |
8192 | 0 | 0 | 0 | 0 |
16384 | 0 | 0 | 0 | 0 |
32768 | 0 | 0 | 0 | 0 |
65536 | 0 | 0 | 0 | 0 |
131072 | 1.3 | 1.8 | 1.3 | 1.8 |
262144 | 2.3 | 2.9 | 2.4 | 3 |
524288 | 3.2 | 3.9 | 3.4 | 3.9 |
1048576 | 3.6 | 4.2 | 4.1 | 4.5 |
2097152 | 22.9 | 33 | 17.8 | 24.8 |
4194304 | 32.6 | 40.9 | 25 | 30.5 |
8388608 | 38.1 | 43.5 | 30.1 | 35 |
16777216 | 43.2 | 48.6 | 37 | 45.7 |
33554432 | 86.2 | 112.2 | 71.2 | 93.7 |
67108864 | 109.3 | 135.2 | 91.4 | 112.1 |
Run 2 with MADV_HUGEPAGE
Run 2 | ||||
---|---|---|---|---|
block size | JG-single_run | JG-dual_run | RB-single_run | RB-dual_run |
1024 | 0 | 0 | 0 | 0 |
2048 | 0 | 0 | 0 | 0 |
4096 | 0 | 0 | 0 | 0 |
8192 | 0 | 0 | 0 | 0 |
16384 | 0 | 0 | 0 | 0 |
32768 | 0 | 0 | 0 | 0 |
65536 | 0 | 0 | 0 | 0 |
131072 | 1.3 | 1.8 | 1.3 | 1.8 |
262144 | 1.9 | 2.3 | 1.9 | 2.4 |
524288 | 2.2 | 2.5 | 2.3 | 2.5 |
1048576 | 2.6 | 2.8 | 2.6 | 2.8 |
2097152 | 21.6 | 31.6 | 16.2 | 23.1 |
4194304 | 31.1 | 39.4 | 23.3 | 28.7 |
8388608 | 35.8 | 41.7 | 26.7 | 30.4 |
16777216 | 38.5 | 43 | 28.3 | 31.5 |
33554432 | 79.9 | 104.9 | 64.9 | 85.9 |
67108864 | 101.1 | 125.4 | 83.7 | 102.9 |
I mapped one of the runs into a graph as seen below
As seen with other latency benchmarks, we're good when comparing L1 and L2 cache. The differences start popping up as we move from L2 Cache to system memory. Once again my results are ~20% faster.
Wow, what a difference the memory seems to make!
I got 2 of the 6 new RAM sticks just now. Running HPL with N=50000, I see:
- Old Transcend RAM (2x16 GB): 279.22 Gflops
- New Samsung RAM (2x16 GB): 369.05 Gflops
Encouraging early result! The rest of the RAM is coming Monday...
And here are the new tinymembench results (NOTE: Just for the 2x16GB sticks, performance will differ filling all the memory channels...):
Click to expand tinymembench results
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)
==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for 'copy' tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==
==========================================================================
C copy backwards : 10261.8 MB/s
C copy backwards (32 byte blocks) : 10233.9 MB/s
C copy backwards (64 byte blocks) : 10238.1 MB/s
C copy : 10277.0 MB/s
C copy prefetched (32 bytes step) : 10403.7 MB/s
C copy prefetched (64 bytes step) : 10407.1 MB/s
C 2-pass copy : 7065.6 MB/s
C 2-pass copy prefetched (32 bytes step) : 8825.9 MB/s
C 2-pass copy prefetched (64 bytes step) : 9179.0 MB/s
C fill : 42770.6 MB/s (1.1%)
C fill (shuffle within 16 byte blocks) : 42675.3 MB/s
C fill (shuffle within 32 byte blocks) : 42755.8 MB/s (0.2%)
C fill (shuffle within 64 byte blocks) : 42587.5 MB/s
NEON 64x2 COPY : 10633.4 MB/s
NEON 64x2x4 COPY : 10679.9 MB/s
NEON 64x1x4_x2 COPY : 6380.2 MB/s (0.1%)
NEON 64x2 COPY prefetch x2 : 12576.1 MB/s
NEON 64x2x4 COPY prefetch x1 : 12767.1 MB/s
NEON 64x2 COPY prefetch x1 : 12462.2 MB/s
NEON 64x2x4 COPY prefetch x1 : 12763.3 MB/s
---
standard memcpy : 10582.3 MB/s
standard memset : 42988.5 MB/s (1.3%)
---
NEON LDP/STP copy : 10645.9 MB/s
NEON LDP/STP copy pldl2strm (32 bytes step) : 11909.5 MB/s
NEON LDP/STP copy pldl2strm (64 bytes step) : 11902.6 MB/s
NEON LDP/STP copy pldl1keep (32 bytes step) : 11816.3 MB/s
NEON LDP/STP copy pldl1keep (64 bytes step) : 11818.2 MB/s
NEON LD1/ST1 copy : 10690.8 MB/s
NEON STP fill : 43059.6 MB/s (1.2%)
NEON STNP fill : 43150.2 MB/s (0.3%)
ARM LDP/STP copy : 10711.8 MB/s
ARM STP fill : 43011.2 MB/s (1.1%)
ARM STNP fill : 43117.3 MB/s (0.2%)
==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can't handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==
==========================================================================
block size : single random read / dual random read, [MADV_NOHUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 1.3 ns / 1.8 ns
262144 : 2.4 ns / 2.9 ns
524288 : 3.4 ns / 3.9 ns
1048576 : 7.7 ns / 11.3 ns
2097152 : 20.5 ns / 29.5 ns
4194304 : 28.9 ns / 36.7 ns
8388608 : 35.7 ns / 41.7 ns
16777216 : 45.2 ns / 55.4 ns
33554432 : 74.5 ns / 95.5 ns
67108864 : 89.0 ns / 107.1 ns
block size : single random read / dual random read, [MADV_HUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 1.3 ns / 1.8 ns
262144 : 1.9 ns / 2.3 ns
524288 : 2.3 ns / 2.5 ns
1048576 : 2.6 ns / 2.8 ns
2097152 : 19.1 ns / 27.8 ns
4194304 : 27.6 ns / 35.0 ns
8388608 : 31.4 ns / 37.3 ns
16777216 : 33.6 ns / 38.6 ns
33554432 : 67.7 ns / 87.8 ns
67108864 : 80.6 ns / 97.5 ns
memcpy
goes from 9894.0
to 10582.3
, a 7% difference (again, with 2 sticks vs 6), while HPL goes from 279 to 369, almost a 30% improvement! Latency is vastly improved over the Transcend RAM as well.
Can't wait for the other sticks to arrive. I will finally pass the 'teraflop on a CPU' barrier :)
I have a Twitter (X?) thread going on about the memory differences. Going to also try to see if I can look up timing data in Linux via decode-dimms
(CPU-Z under Windows on Arm isn't showing timing data).
Hmm...
$ sudo apt install -y i2c-tools
$ sudo modprobe eeprom
$ decode-dimms
# decode-dimms version 4.3
Memory Serial Presence Detect Decoder
By Philip Edelbrock, Christian Zuckschwerdt, Burkart Lingner,
Jean Delvare, Trent Piepho and others
Number of SDRAM DIMMs detected and decoded: 0
@geerlingguy The next part of the test was the memory latency test. The results are as below
Run 1
block size JG-single_run JG-dual_run RB-single_run RB-dual_run
1024 0 0 0 0
2048 0 0 0 0
4096 0 0 0 0
8192 0 0 0 0
16384 0 0 0 0
32768 0 0 0 0
65536 0 0 0 0
131072 1.3 1.8 1.3 1.8
262144 2.3 2.9 2.4 3
524288 3.2 3.9 3.4 3.9
1048576 3.6 4.2 4.1 4.5
2097152 22.9 33 17.8 24.8
4194304 32.6 40.9 25 30.5
8388608 38.1 43.5 30.1 35
16777216 43.2 48.6 37 45.7
33554432 86.2 112.2 71.2 93.7
67108864 109.3 135.2 91.4 112.1
Run 2
block size JG-single_run JG-dual_run RB-single_run RB-dual_run
1024 0 0 0 0
2048 0 0 0 0
4096 0 0 0 0
8192 0 0 0 0
16384 0 0 0 0
32768 0 0 0 0
65536 0 0 0 0
131072 1.3 1.8 1.3 1.8
262144 1.9 2.3 1.9 2.4
524288 2.2 2.5 2.3 2.5
1048576 2.6 2.8 2.6 2.8
2097152 21.6 31.6 16.2 23.1
4194304 31.1 39.4 23.3 28.7
8388608 35.8 41.7 26.7 30.4
16777216 38.5 43 28.3 31.5
33554432 79.9 104.9 64.9 85.9
67108864 101.1 125.4 83.7 102.9
I mapped one of the runs into a graph as seen belowAs seen with other latency benchmarks, we're good when comparing L1 and L2 cache. The differences start popping up as we move from L2 Cache to system memory. Once again my results are ~20% faster.
Hi Jeef,
I am not sure about the Run1 and Run2 meaning, I tried to do the tinymembench and I saw [MADV_NOHUGEPAGE] at 1st run the is [MADV_HUGEPAGE], so you did the tinymembench with same configuration for 2 times or you did 1 time but get 2 result(HUGEPAGE and NOHUGEPAGE)?
Thanks
BR
ii-BOY
@ii-BOY . Hi this test was run just once.
Internally, tinymembench ran twice. Once with THP disabled (MADV_NOHUGEPAGE) and once with THP enabled (MADV_NOHUGEPAGE).
You can find more information here : https://man7.org/linux/man-pages/man2/madvise.2.html
Thanks.
Note : Thanks for pointing out the omission in descriptions for Run1 and Run2 tables that were posted. I've edited those to reflect MADV_NOHUGEPAGE and MADV_HUGEPAGE in their tables respectively.
tinymembench
run with all six sticks (96 GB total) of Sam sung RAM:
Click to view tinymembench results
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)
==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for 'copy' tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==
==========================================================================
C copy backwards : 11416.8 MB/s
C copy backwards (32 byte blocks) : 11374.5 MB/s
C copy backwards (64 byte blocks) : 11380.7 MB/s
C copy : 11486.6 MB/s
C copy prefetched (32 bytes step) : 12074.4 MB/s
C copy prefetched (64 bytes step) : 12072.5 MB/s
C 2-pass copy : 7456.1 MB/s
C 2-pass copy prefetched (32 bytes step) : 8489.5 MB/s
C 2-pass copy prefetched (64 bytes step) : 8901.7 MB/s
C fill : 43888.0 MB/s
C fill (shuffle within 16 byte blocks) : 43888.0 MB/s
C fill (shuffle within 32 byte blocks) : 43888.3 MB/s
C fill (shuffle within 64 byte blocks) : 43882.9 MB/s
NEON 64x2 COPY : 12176.6 MB/s
NEON 64x2x4 COPY : 12229.0 MB/s
NEON 64x1x4_x2 COPY : 10022.1 MB/s
NEON 64x2 COPY prefetch x2 : 13542.4 MB/s
NEON 64x2x4 COPY prefetch x1 : 13902.0 MB/s
NEON 64x2 COPY prefetch x1 : 13579.6 MB/s
NEON 64x2x4 COPY prefetch x1 : 13903.2 MB/s
---
standard memcpy : 12107.0 MB/s
standard memset : 44746.4 MB/s
---
NEON LDP/STP copy : 12186.3 MB/s
NEON LDP/STP copy pldl2strm (32 bytes step) : 13778.2 MB/s
NEON LDP/STP copy pldl2strm (64 bytes step) : 13785.9 MB/s
NEON LDP/STP copy pldl1keep (32 bytes step) : 13847.4 MB/s
NEON LDP/STP copy pldl1keep (64 bytes step) : 13825.8 MB/s
NEON LD1/ST1 copy : 12242.3 MB/s
NEON STP fill : 44745.9 MB/s
NEON STNP fill : 44747.5 MB/s
ARM LDP/STP copy : 12298.1 MB/s
ARM STP fill : 44730.0 MB/s
ARM STNP fill : 44730.8 MB/s
==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can't handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==
==========================================================================
block size : single random read / dual random read, [MADV_NOHUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 1.3 ns / 1.8 ns
262144 : 2.4 ns / 2.9 ns
524288 : 3.4 ns / 3.9 ns
1048576 : 4.1 ns / 4.4 ns
2097152 : 23.2 ns / 33.2 ns
4194304 : 32.7 ns / 41.1 ns
8388608 : 39.7 ns / 46.2 ns
16777216 : 47.7 ns / 51.0 ns
33554432 : 81.6 ns / 103.5 ns
67108864 : 102.1 ns / 122.2 ns
block size : single random read / dual random read, [MADV_HUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 0.0 ns / 0.0 ns
131072 : 1.3 ns / 1.8 ns
262144 : 1.9 ns / 2.3 ns
524288 : 2.3 ns / 2.5 ns
1048576 : 2.6 ns / 2.8 ns
2097152 : 21.6 ns / 31.6 ns
4194304 : 31.4 ns / 39.4 ns
8388608 : 36.2 ns / 41.7 ns
16777216 : 38.5 ns / 43.0 ns
33554432 : 74.8 ns / 95.7 ns
67108864 : 93.6 ns / 112.0 ns
New result:
================================================================================
HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 105000
NB : 256
PMAP : Row-major process mapping
P : 8
Q : 12
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 105000 256 8 12 649.46 1.1883e+03
HPL_pdgesv() start time Mon Sep 11 20:21:22 2023
HPL_pdgesv() end time Mon Sep 11 20:32:11 2023
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 2.00850780e-03 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
1188.3 Gflops at 296W = 4.01 Gflops/W
It seems like my Samsung RAM still performs just under whatever RAM @rbapat-ampere is using in his system, so that seems to explain the delta!
I think this issue can be closed, as we've found the culprit.