geerlingguy/top500-benchmark

Benchmark Ampere Altra Developer Platform - 96 core 2.8 GHz ARM64

geerlingguy opened this issue · 13 comments

As the title says...

With Ps: 1 and Qs: 96:

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   70717
NB     :     256
PMAP   : Row-major process mapping
P      :       1
Q      :      96
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       70717   256     1    96             625.64             3.7685e+02
HPL_pdgesv() start time Mon Apr 17 09:47:19 2023

HPL_pdgesv() end time   Mon Apr 17 09:57:44 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   7.57859825e-04 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Used 220W average (with a few spikes to 236W, but only briefly, and some dips down to 216W). 1.71 Gflops/W

With Ps: 4 and Qs: 24 (since I believe the die is subdivided into 4 quadrants):

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   70717
NB     :     256
PMAP   : Row-major process mapping
P      :       4
Q      :      24
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       70717   256     4    24             586.68             4.0188e+02
HPL_pdgesv() start time Mon Apr 17 10:14:23 2023

HPL_pdgesv() end time   Mon Apr 17 10:24:10 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   6.70141896e-04 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Used 200W average (with a few spikes to 215W, and some dips down to 196W). 2.01 Gflops/W

After some suggestions from @rbapat-ampere over in geerlingguy/sbc-reviews#19 (comment), I'm going to be re-testing with a few different parameters, to get a feel for how things change:

Test matrix:

RAM Ps / Qs Blis library Benchmark Result Power Consumption
64 GB (4x 16 GB) 4 / 24 default 401.88 Gflops 202W
64 GB (4x 16 GB) 8 / 12 default 394.15 Gflops 200W
96 GB (6x 16 GB) 4 / 24 default 600.63 Gflops 235W
96 GB (6x 16 GB) 8 / 12 default 582.90 Gflops 232W
96 GB (4x 16 GB) 8 / 12 ampere-optimized 985.02 Gflops 270W

Note: For power consumption, I compared a Sonoff S31 power outlet adapter and a Kill-A-Watt power meter, and re-ran all tests on both. They were within 2W in spot measurements, and within 1W in averages over a 1 minute time period.

Trying to get the Ampere-optimized HPL run to work, but currently running into issues: AmpereComputing/HPL-on-Ampere-Altra#3

I was originally going to test things by trying to swap their library into my install, but decided to just end-to-end try testing their docs in that repo.

Result with the ampere-optimized setup following these instructions:

root@adlink-ampere:/opt/hpl-2.3/bin/Altramax_oracleblis# mpirun -np 96 --bind-to core --map-by core --allow-run-as-root ./xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  100000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      12 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      100000   256     8    12             676.82             9.8502e+02
HPL_pdgesv() start time Thu Jun 15 16:18:54 2023

HPL_pdgesv() end time   Thu Jun 15 16:30:11 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.01180641e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Power measured at average of 270W (spiking to about 272W, dipping to 245W). Efficiency: 3.64 Gflops/W

Just noting that my M1 Max results may also improved from a native BLAS library—see, for example: JuliaLang/julia#42312

@geerlingguy Glad to see the jump in scores. But we are still leaving some performance on the table.
On the Ampere Altra Developer Platform I was able to get 1253 GFlops for HPL using the optimized BLIS.
Here's a screen cap from the document :

image

@rbapat-ampere - Interesting... I ran with 100000 Ns, and 8/12 P/Q, and that was how I got the 985.02 Gflops. Can you think of anything else I might've missed. I followed the instructions from here explicitly, and ran them all on a brand new fresh Ubuntu 22.04 Server install.

@geerlingguy
I rebuilt, reran HPL + Optimized BLIS from scratch (using the instructions) on a fresh Ubuntu installed AADP and got very similar scores to my previous results.
Here are my current scores

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  105000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      12 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      105000   256     8    12             631.81             1.2215e+03
HPL_pdgesv() start time Fri Jun 16 14:10:59 2023

HPL_pdgesv() end time   Fri Jun 16 14:21:31 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.00850780e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Testbench Details :
OS : 22.04.2 LTS (Jammy Jellyfish) : Desktop Image
Kernel : 5.19.0-42-generic
GCC Toolchain : 12.3.0
openmpi : 4.1.4
Memory used during the test : 89 gig

@rbapat-ampere - I shall reformat and run it again :)

Can you also confirm what RAM layout you're using? Is it 6x 16 GB sticks, or something else? That seems to have an outsize effect on the results.

It seems like the RAM vendor is the only major difference—I'm running industrial-type Transcend RAM, and @rbapat-ampere is running Samsung... I may need to change vendors and see if that gets our numbers more in line (stranger things have happened!).

I'm also planning on re-testing on a 128 core CPU soon too...

New result is 1188.3 Gflops at 296W, for 4.01 Gflops/W

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  105000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      12 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      105000   256     8    12             649.46             1.1883e+03
HPL_pdgesv() start time Mon Sep 11 20:21:22 2023

HPL_pdgesv() end time   Mon Sep 11 20:32:11 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   2.00850780e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Closing this as we have a result!