[website] Expected performance results on website are mostly wrong

Question

[website] Expected performance results on website are mostly wrong

Closed this issue 2 months ago · 2 comments

https://mflowcode.github.io/documentation/md_expectedPerformance.html

Most of these numbers are incorrect. It's unclear where things went wrong. @wilfonba and I already confirmed that the A100 numbers are incorrect.

One comment is that this page should have an example of how exactly to run the performance test locally. For example, the command ./mfc.sh run -n 8 -j 8 ./examples/3D_performance_test/case.py --case-optimization -t pre_process simulation or some such for CPU and the addition of --gpu for GPU cases.

I ran the 3D_performance_test example with 4M and 8M grid points on my M1 Max on 8 Cores, gfortran 14.1.0 and got:

1M GPs (100^3): Performance: 74.107741811522786 ns/gp/eq/rhs
4M GPs (159^3): Performance: 70.347097355807136 ns/gp/eq/rhs
8M GPs (200^3): Performance: 71.969625308176333 ns/gp/eq/rhs

which is a factor of 5x faster than what's on the website for the M2 chip. I know the M1 Max is probably faster than the M2 for this workload, but not 5x faster. Again, @wilfonba replicated this problem on NV A100s as well. These results should all be updated.

We can remove Summit performance results instead of generic V100 test results. We also don't to have 1, 4, and 8M grid point cases. The numbers are so similar regardless. I think we should just converge on 8M grid points (200^3 simulation) for all performance tests, which is big enough to be meaningful but not too big to overwhelm the memory of any real device.

Open to other suggestions!

Answer 1 · 2024-07-13T15:24:55.000Z

I'm gathering some more info, all using 8M grid points. This is everything I have. I didn't run a test on Frontier, but we should also update that number.

Intel Xeon Gold 6226 CPU (Cascade Lake) @ 2.70GHz (on Phoenix), 12 core CPU, best performance using 12 cores, Intel oneAPI 2022.1.0

Performance: 151.599077472947 ns/gp/eq/rhs

AMD EPYC 7713 (Milan) 64-Core CPU, best performance using 32 cores. gcc12.1.0

Performance: 137.48353539352445 ns/gp/eq/rhs

M1 Max, 8 Cores. gcc14.1

Performance: 71.969625308176333 ns/gp/eq/rhs

RTX6000 (single-precision GPU upconverting to DP in software) @ Phoenix, NVHPC 22.11

Performance: 3.851041689413657 ns/gp/eq/rhs

A40 (single-precision GPU upconverting to DP in software) @ NCSA Delta, NVHPC 22.11

Performance: 3.316569112456631 ns/gp/eq/rhs

MI250X 1 GCD, CCE16.0.1

Performance: 1.0871197509246793 ns/gp/eq/rhs

A30 @ RG, NVHPC 24.1

Performance: 1.055906093866407 ns/gp/eq/rhs

V100-32GB @ Phoenix, NVHPC 24.5

Performance: 0.9892712201437496 ns/gp/eq/rhs

A100-80GB @ Phoenix, NVHPC 22.11

Performance: 0.6163026871295073 ns/gp/eq/rhs

H100 80GB PCIe @ Rogues Gallery, NVHPC 24.5

Performance: 0.4362547841810634 ns/gp/eq/rhs

GH200 @ Rogues Gallery, NVHPC 24.1, (only the GPU is used)

Performance: 0.3201266592472489 ns/gp/eq/rhs

Answer 2 · 2024-07-24T17:56:53.000Z

I want to add A40 and RTX____ to this list (single precision GPU that will convert in software to DP)

Update: Added. Want to add MI100 and MI210 if possible. working on it.