Discussion: Proposal for a major redesign of PyOP2
Closed this issue · 6 comments
The following is a summary of my thoughts thus far regarding how we can redesign PyOP2 to enable new features/better performance. I think this should be seen as a first step along the way to having cool new features like:
- More generic abstractions for semi-structured meshes (e.g. adding support for cube sphere meshes)
- Iteration set composition (e.g. loop tiling, patch iterations)
Key proposal
We should make PyOP2 aware of the mesh topology by introducing a new Mesh
class as well as informing Dat
objects of their data layout via a PETSc Section
.
Justification
This change allows us to make a number of improvements to PyOP2 including:
- It gives us a much cleaner interface as we do not need to pass maps around since these can be automatically derived from the mesh object and sections.
- It lets us get away with making fewer indirections every time we pack data (described below).
Implementation
Parloops
The interface for executing parloops should change from:
parloop(kernel, iterset, dat0(map0), dat1(map0), ...)
To something like:
parloop(mesh, kernel, iterset, dat0, dat1, ...)
Where one can see that all of the maps have been removed and we instead have a new mesh
argument.
Edit: This is not correct since there are cases where, for example, we may have a dat defined over the closure of a cell where we only want to access the data stored on the cell and not the edges/vertices. An application of this would be zeroing a dat where we want to touch every entity exactly once.
This means that the API would need to look something like:
parloop(mesh, kernel, CELLS, dat0(CELL_CLOSURE), ...)
The mesh
The new mesh object has a number of responsibilities:
- It should know the arities of the mappings between the different entity sets in the topology. For example, it should know that for a triangle mesh that there are 3 edges and 3 vertices to every cell and 2 vertices to every edge. This is required for code generation.
- It should be able to take a PETSc
Section
(attached to a dat) and generate a map between entities in the iteration set and array offsets. This would be done by composing the iteration set -> dat base set and dat base set -> dat entity set maps from the plex with the dat entity set -> offsets map coming from theSection
.
Note: Here we define two different types of set relating to the dat: the base set, which describes the 'entry point' to the dat, and the entity set, which describes the set of entities for which data is stored. To provide an example, a dat with data defined over the closure of a cell would have its base set as the cells of the mesh, and its entity set as all mesh entities for which it stores data (e.g. edges + vertices). We would therefore need to compose a map from the iteration set -> cells with a map from cells -> vertices + edges.
Note: When composing two plex maps (e.g. interior facets -> cells -> dat entities), we would simply take a multiplicative approach and multiply the arities. For the interior facet case on triangles with data kept on vertices this would result in a map from interior facets -> vertices with arity 2x3=6.
Dats
For code generation, each Dat
object should now store a section-like object describing the DoF count per entity. For example, a dat storing data for the triangle below would record {CELL: 1, EDGE: 2, VERT: 1}
.
+
|\
+ +
| \
+ + +
| \
+-+-+-+
These values correspond to the number of contiguous degrees of freedom stored on each entity (if we had any tensor shape then these numbers would just be multiplied by a scalar) and therefore they can all be packed/unpacked using a single indirection.
Code generation
Since dats are now described by multiple dim
values, one per entity type, we require different loops for each entity type in order to pack/unpack the data.
Here is an example of the sort of packing code that would get generated for a computation over cells where the dat stores data over the closure of the cell:
int offset, poffset, n, i, j;
int ndofs = ncells + nedges + nverts;
int packed[ncells*ncelldofs+nedges*nedgedofs+nverts*nvertdofs]
for (n=start; n<end; n++) { // overall iteration
offset = 0; // offset counter for the map
poffset = 0; // offset counter for packing
for (i=0; i<ncells; i++) // pack for arity (N.B. ncells == 1)
for (j=0; j<ncelldofs; j++) // pack for dim
packed[poffset+i*ncells+j] = dat[map[n*ndofs+offset+i]+j];
offset += ncells
poffset += ncells * ncelldofs;
for (i=0; i<nedges; i++)
for (j=0; j<nedgedofs; j++)
packed[poffset+i*nedges+j] = dat[map[n*ndofs+offset+i]+j];
offset += nedges
poffset += nedges * nedgedofs
for (i=0; i<nverts; i++)
for (j=0; j<nvertdofs; j++)
packed[poffset+i*nverts+j] = dat[map[n*ndofs+offset+i]+j];
}
It is significant to note that, with this arrangement, fewer indirections are required to load all of the data since more is loaded as a contiguous chunk (since arity
is smaller and dim
is larger). For the example layout given above this would reduce the number of indirections from 10 to 7.
Thanks for reading. Please let me know your thoughts.
Note that your pack/unpack code is DMPlexVecGet/SetClosure
Note that your pack/unpack code is
DMPlexVecGet/SetClosure
Semantically you're right, but here we memoize the lookups as a map since calling DMPlexVecGet/SetClosure
is a non-trivial thing to do (e.g. https://petsc.org/release/src/dm/impls/plex/plex.c.html#DMPlexVecGetClosure). I can also imagine that operations like inter-element vectorisation would not be possible if we called PETSc routines.
Interelement vectorisation is orthogonal to the calls to do the packing. The packing just needs to put stuff in the right place. We can also consider extending the PETSc API to support chunking.
I contend (admittedly with no evidence) that having compile-time known loop bounds for pack/unpack is not actually important for the speed of this code.
memoize the lookups
This is what PetscSectionSetClosureIndex
does.
I realise that inlined code is likely still faster but ideally it would be good to see some numbers on how much.
Interelement vectorisation is orthogonal to the calls to do the packing. The packing just needs to put stuff in the right place. We can also consider extending the PETSc API to support chunking.
OK yeah that makes sense.
memoize the lookups
This is what
PetscSectionSetClosureIndex
does.
Oh brilliant. I didn't realise that this was a thing.
I realise that inlined code is likely still faster but ideally it would be good to see some numbers on how much.
I'll try to write some code that investigates this. If just calling PETSc routines is sufficiently fast then this would dramatically simplify the code generation done by PyOP2.
So I have done some experimentation on how efficient the different types of packing are:
Experiment
To investigate the packing speeds, I took a PETSc Vec
that stored data on a cell with 1 point per cell, 1 per vertex and 3 per edge (shown below). I also gave the data a tensor structure so there were 2 DoF stored per point. This means, for example, that each edge has 6 DoF.
+
|\
+ +
+ +
+ + +
| \
+-+++-+
For the experiment, I looped over the cells of the mesh and incremented all of the data points defined in the closure of the Vec
. This simulated the situation in PyOP2 where we read (pack) and write (unpack) a Dat
where the local kernel does minimal work.
I made the mesh sufficiently big so the data structures being iterated over would not fit in the cache.
Results
My results are shown in the figures below (the right one is just a zoomed in version of the left):
The four columns correspond to:
petsc uncached
: This is the naive implementation where at every iteration we callDMPlexVecGet/SetClosure
.petsc cached
: This is the performance I got usingPetscSectionGetClosureIndex
.hardcoded1
: This was my attempt to implement the packing code that I wrote above.hardcoded2
: This tries to simulate the existing packing code that PyOP2 currently uses.
For me the key observations are:
petsc uncached
is catastrophically slow. Memoization of the maps is essential for performance.petsc cached
takes over twice as long as eitherhardcoded
option. Since packing data is the dominant cost for memory-bound codes and this cost scales with DoF count, I argue that this is a significant difference and we should not replace our packing code with calls toPetscSectionGetClosureIndex
.hardcoded1
is slightly faster thanhardcoded2
. I think that this proves that the 'entity-aware' approach I described above is worth doing.
Other thoughts
Having finally gotten my hands dirty writing proper PETSc code, I have gained more of an appreciation for how well put together their data model is and how it would fit with PyOP2. I therefore think that I should rephrase the overall goal of this proposed redesign from:
We should make PyOP2 aware of the mesh topology by introducing a new
Mesh
class as well as informingDat
objects of their data layout via a PETScSection
.
to:
We should make PyOP2 use the same data model as PETSc.
(@dham deserves all the credit for this idea since he had it a long time ago and I am only now realising what he meant by it)
This is quite an out-of-date discussion and my pyop3 work supercedes this. Closing.