About Performance Issues Using Accessors
AngryBear2 opened this issue · 13 comments
I found that it was slower when using accessors than when using arrays directly.For example,
z[i][j] += x[i][k] * y[k][j];
is faster than acc_out[Point<2>(j,i)]+=acc_x[Point<2>(k,i)]*acc_y[Point<2>(j,k)];
.
This problem is particularly obvious when I try to write GEMM, by creating a two-dimensional array, reading matrices A and B into the two-dimensional array through the value of acc in the accessor, calculating it, and then writing it back to the output region. It is much faster than using the accessor directly for matrix multiplication calculations. Because if I use a two-dimensional matrix, acc_A, acc_B, and acc_C each only needs to be accessed N*N times, and when I use accessors directly for matrix budgeting, each accessor needs to be accessed N*N*N times.When I try a matrix multiplication of 256 * 256, it slows me down by more than ten times.
What's the reason for the slower accessors, and is it a good idea to read into an array and recalculate?
By the way, why is the performance of
const FieldAccessor<READ_ONLY,double,1,coord_t,Realm::AffineAccessor<double,1,coord_t>>acc_x(regions[0], FID_X);
far greater than that of
const FieldAccessor<READ_ONLY,double,1> acc_x(regions[0], FID_X);
Because if I use a two-dimensional matrix, acc_A, acc_B, and acc_C each only needs to be accessed NN times, and when I use accessors directly for matrix budgeting, each accessor needs to be accessed NN*N times.When I try a matrix multiplication of 256 * 256, it slows me down by more than ten times.
I don't understand this. Can you explain again? Using an accessor in the way that you've described should compile down to the same code as the direct array access.
One difference I see with your accessor statement is that you are accessing the arrays with different strides than in the non-accessor case, which greatly affects the locality that you are getting on your accesses (i.e. acc_out
is being accessed (j, i), while z
is being accessed (i, j), and others...)
AffineAccessor
is much faster than a generic accessor as it is heavily templated to make sure that affine accessor accesses compile down into direct memory accesses.
@rohany
Here's the thing, sir.
It seems that you can get better performance by taking the data out of the Accessor, calculating it, and then saving it back, than by using more Accessors.
I want to know what overhead the accessor adds compared to direct array access.
Even though the AffineAccessor is heavily templated, I am curious wether the compiler can vectorize the access, while using the raw pointer, the compiler could easier apply the vectorization optimization.
May I see the entire code you’re comparing, in particular the instantiation of the accessors?
There could be a few things wrong — you might have bounds checks on for accessors, or you might be mapping an instance that isn’t laid out the way you expect (the default mapper allocates instances in column major).
the best thing to do here though is to just look at the assembly of your loops — if everything is working as expected, the assembly should look nearly the same between accessor and no accessor.
@eddy16112 I don't think the compiler is vectorizing the raw pointer, I'm using g++, and I'm using -O0 optimizations.
The code looks like this, by contrast, just fetching the data from the accessor in gemm_task and calculating it.
What about the performance with O2 or O3? Accessors only get optimizes away if you turn on optimizations
As I mentioned before, these two codes aren’t exactly comparable either — you’re making accesses of different strides / in different index orders between the accessor version and the non accessor version
I'm going to try O2 or O3, acc_out[Point<2>(j, i)] I think it's going to be out[i][j], because when I try the pointer pir++ of a two-dimensional array, From acc_out[Point<2>(j, i)] to acc_out[Point<2>(j+1, i)]
@rohany yes! When I use O2 optimization, I can already get a huge performance boost, very close to direct array computing, although still a little slower. Thank you very much. You've helped me a lot.
I'm going to try O2 or O3, acc_out[Point<2>(j, i)] I think it's going to be out[i][j], because when I try the pointer pir++ of a two-dimensional array, From acc_out[Point<2>(j, i)] to acc_out[Point<2>(j+1, i)]
As i mentioned above, the default mapper (which I assume you're using) allocates instances in fortran order, or column major order. If you change line https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/mappers/default_mapper.cc?ref_type=heads#L2253 to be reversed, like https://github.com/rohany/taco/blob/DISTAL/DISTAL/runtime/taco_mapper.cpp#L391, then the instance will be allocated in row-major order, where doing acc_out[Point<2>(i, j)]
will give you the memory location you expect.
@rohany Well, I know this thing, my benefactor, I wish you health and happiness every day.
A couple of comments about (affine) accessor performance:
- With optimizations off, accessors will be much slower than using pointers. There are several layers of anonymous objects that get created by accessors to support the array indexing syntax. With optimizations on then they all get optimized away, but with optimizations off, you'll be paying for several object constructors and destructors on the stack and several function calls for each access.
- If you're going to compare accessor performance to using raw pointers, make sure your strides are actually dynamic values in the non-accessor code. If the compiler can see constants it will constant fold them away and then you're not really comparing apples-to-apples because the strides will still look dynamic in an accessor. If you know your strides are compile-time constants you should probably get a raw pointer and generate loops with constant strides as the compiler will generate faster code. We don't have a way today to specify static strides in our accessors since they are designed to be portable across different layouts. If you want to bake in assumptions about layouts into the code that is what the raw pointer support is for. I wouldn't complain if someone wants to go build an accessor that supports compile-time constant strides and make a merge request, but it's not at the top of the priority list for Legion right now.
- For multi-dimensional accessors, I've noticed that even with optimizations on, compilers usually will not automatically deduce the strength reduction optimization on the pointer arithmetic (despite the fact that the accesor object and therefore it's member strides are marked
const
), and therefore it ends up doing a full inner product of the indexing point with the strides on each iteration of a (nested) loop. It's quite unfortunate that this occurs, but I've noticed the same behavior on gcc, clang, and icc, so something about our code is preventing the strength reduction optimization, but I have no idea what it is. - With optimizations on (
-O3
), I have seen the compiler automatically vectorize code that uses accesors, although I personally have only seen this when using 1-D accessors. (To confirm vectorization you have to look at the generated assembly code.) I think the multi-dimensional accessors have problem with the strength reduction optimization mentioned above which is a precursor to the vectorization optimization (e.g. knowing that you can do vectorized loads and stores).