vetter/shoc

MIC MaxFlops Doesn't Match Peak Performance

jyoung3131 opened this issue · 3 comments

The MIC version of MaxFlops reports a value that seems to be too high by a factor of 10.

Running benchmark MaxFlops
result for maxspflops: 18065.5000 GFLOPS
result for maxdpflops: 13108.8000 GFLOPS

From Reza's blog, the max performance should be something around the theoretical peaks of 2112 GFLOPS, SP and 1056 GFLOPS, DP.
https://software.intel.com/en-us/articles/intel-xeon-phi-core-micro-architecture

This could possibly just be a reporting error.

Hi Jeff, besides checking the reporting code, the next thing to do is to ensure that the compiler is not optimizing away any of the operations. We have ran into overestimations of performance in maxflops because of this before.

This does indeed seem to be due to optimization of operations. MAdd8 is close but not over the theoretical max (2K GFLOPS), but all the Add operations seem to be affected by removal of ops. The driver is picking up these inaccurate results as the max for the device.

I tested with the -O1 and -O2 flags, and the overestimation occurs when O2 or O3 is used. Unfortunately not using -O2 means that the code is way too slow (possibly due to other vector-related optimizations for the Phi). Currently testing some tricks to try and keep the compiler from optimizing out operations.

5110P results for Ph (GFLOPS):
Add1-DP: 53.6607
Add1-SP: 60.8723
Add2-DP: 12542.6
Add2-SP: 16056.9
Add4-DP: 12516.6
Add4-SP: 17495.2
Add8-DP: 12820.2
Add8-SP: 18006.6

I've got code from a book that is supposed to get over 2 TFLOPS/S on the Xeon Phi in single precision.

Are we locked into getting the max flops using a particular maxflops algorithm?

I've got the entire code, but the meat of it looks like this:

#pragma omp parallel for private(j,k)
for (i=0; i<numthreads; i++) {
int offset = i_LOOP_COUNT;
for(j=0; j<MAXFLOPS_ITERS; j++) {
for(k=0; k<LOOP_COUNT; k++)
fa[k+offset]=a_fa[k+offset]+fb[k+offset];
}
}

This is what the author reports on 61 cores:

% export OMP_NUM_THREADS=122
% export KMP_AFFINITY=scatter
% ./helloflops3
GFlops = 3123.200, Secs = 1.530, GFlops per sec = 2041.090

Mitch

----- Original Message -----

From: "jyoung3131" notifications@github.com
To: "vetter/shoc" shoc@noreply.github.com
Sent: Tuesday, October 14, 2014 4:08:18 PM
Subject: Re: [shoc] MIC MaxFlops Doesn't Match Peak Performance (#37)

This does indeed seem to be due to optimization of operations. MAdd8 is close but not over the theoretical max (2K GFLOPS), but all the Add operations seem to be affected by removal of ops. The driver is picking up these inaccurate results as the max for the device.
I tested with the -O1 and -O2 flags, and the overestimation occurs when O2 or O3 is used. Unfortunately not using -O2 means that the code is way too slow (possibly due to other vector-related optimizations for the Phi). Currently testing some tricks to try and keep the compiler from optimizing out operations.
5110P results for Ph (GFLOPS):
Add1-DP: 53.6607
Add1-SP: 60.8723
Add2-DP: 12542.6
Add2-SP: 16056.9
Add4-DP: 12516.6
Add4-SP: 17495.2
Add8-DP: 12820.2
Add8-SP: 18006.6

Reply to this email directly or view it on GitHub .