Discussion: Proposal for a major redesign of PyOP2

Question

Discussion: Proposal for a major redesign of PyOP2

Closed this issue 2 years ago · 6 comments

The following is a summary of my thoughts thus far regarding how we can redesign PyOP2 to enable new features/better performance. I think this should be seen as a first step along the way to having cool new features like:

More generic abstractions for semi-structured meshes (e.g. adding support for cube sphere meshes)
Iteration set composition (e.g. loop tiling, patch iterations)

Key proposal

We should make PyOP2 aware of the mesh topology by introducing a new Mesh class as well as informing Dat objects of their data layout via a PETSc Section.

Justification

This change allows us to make a number of improvements to PyOP2 including:

It gives us a much cleaner interface as we do not need to pass maps around since these can be automatically derived from the mesh object and sections.
It lets us get away with making fewer indirections every time we pack data (described below).

Implementation

Parloops

The interface for executing parloops should change from:

parloop(kernel, iterset, dat0(map0), dat1(map0), ...)

~~To something like:~~

parloop(mesh, kernel, iterset, dat0, dat1, ...)

~~Where one can see that all of the maps have been removed and we instead have a new mesh argument.~~

Edit: This is not correct since there are cases where, for example, we may have a dat defined over the closure of a cell where we only want to access the data stored on the cell and not the edges/vertices. An application of this would be zeroing a dat where we want to touch every entity exactly once.

This means that the API would need to look something like:

parloop(mesh, kernel, CELLS, dat0(CELL_CLOSURE), ...)

The mesh

The new mesh object has a number of responsibilities:

It should know the arities of the mappings between the different entity sets in the topology. For example, it should know that for a triangle mesh that there are 3 edges and 3 vertices to every cell and 2 vertices to every edge. This is required for code generation.
It should be able to take a PETSc Section (attached to a dat) and generate a map between entities in the iteration set and array offsets. This would be done by composing the iteration set -> dat base set and dat base set -> dat entity set maps from the plex with the dat entity set -> offsets map coming from the Section.

Note: Here we define two different types of set relating to the dat: the base set, which describes the 'entry point' to the dat, and the entity set, which describes the set of entities for which data is stored. To provide an example, a dat with data defined over the closure of a cell would have its base set as the cells of the mesh, and its entity set as all mesh entities for which it stores data (e.g. edges + vertices). We would therefore need to compose a map from the iteration set -> cells with a map from cells -> vertices + edges.

Note: When composing two plex maps (e.g. interior facets -> cells -> dat entities), we would simply take a multiplicative approach and multiply the arities. For the interior facet case on triangles with data kept on vertices this would result in a map from interior facets -> vertices with arity 2x3=6.

Dats

For code generation, each Dat object should now store a section-like object describing the DoF count per entity. For example, a dat storing data for the triangle below would record {CELL: 1, EDGE: 2, VERT: 1}.

    +
    |\
    + +
    |  \
    + + +
    |    \
    +-+-+-+

These values correspond to the number of contiguous degrees of freedom stored on each entity (if we had any tensor shape then these numbers would just be multiplied by a scalar) and therefore they can all be packed/unpacked using a single indirection.

Code generation

Since dats are now described by multiple dim values, one per entity type, we require different loops for each entity type in order to pack/unpack the data.
Here is an example of the sort of packing code that would get generated for a computation over cells where the dat stores data over the closure of the cell:

int offset, poffset, n, i, j;
int ndofs = ncells + nedges + nverts;
int packed[ncells*ncelldofs+nedges*nedgedofs+nverts*nvertdofs]

for (n=start; n<end; n++) {  // overall iteration
	offset = 0;  // offset counter for the map
	poffset = 0;  // offset counter for packing
	for (i=0; i<ncells; i++)  // pack for arity (N.B. ncells == 1)
		for (j=0; j<ncelldofs; j++)  // pack for dim
			packed[poffset+i*ncells+j] = dat[map[n*ndofs+offset+i]+j];
	offset += ncells
	poffset += ncells * ncelldofs;
	
	for (i=0; i<nedges; i++)
		for (j=0; j<nedgedofs; j++)
			packed[poffset+i*nedges+j] = dat[map[n*ndofs+offset+i]+j];
	offset += nedges
	poffset += nedges * nedgedofs
	
	for (i=0; i<nverts; i++)
		for (j=0; j<nvertdofs; j++)
			packed[poffset+i*nverts+j] = dat[map[n*ndofs+offset+i]+j];
}

It is significant to note that, with this arrangement, fewer indirections are required to load all of the data since more is loaded as a contiguous chunk (since arity is smaller and dim is larger). For the example layout given above this would reduce the number of indirections from 10 to 7.

Thanks for reading. Please let me know your thoughts.

Answer 1 · 2022-02-01T12:20:17.000Z

Note that your pack/unpack code is DMPlexVecGet/SetClosure

Answer 2 · 2022-02-01T12:32:10.000Z

Note that your pack/unpack code is DMPlexVecGet/SetClosure

Semantically you're right, but here we memoize the lookups as a map since calling DMPlexVecGet/SetClosure is a non-trivial thing to do (e.g. https://petsc.org/release/src/dm/impls/plex/plex.c.html#DMPlexVecGetClosure). I can also imagine that operations like inter-element vectorisation would not be possible if we called PETSc routines.

Answer 3 · 2022-02-01T12:59:25.000Z

Interelement vectorisation is orthogonal to the calls to do the packing. The packing just needs to put stuff in the right place. We can also consider extending the PETSc API to support chunking.

I contend (admittedly with no evidence) that having compile-time known loop bounds for pack/unpack is not actually important for the speed of this code.

memoize the lookups

This is what PetscSectionSetClosureIndex does.

I realise that inlined code is likely still faster but ideally it would be good to see some numbers on how much.

Answer 4 · 2022-02-01T13:52:40.000Z

Interelement vectorisation is orthogonal to the calls to do the packing. The packing just needs to put stuff in the right place. We can also consider extending the PETSc API to support chunking.

OK yeah that makes sense.

memoize the lookups

This is what PetscSectionSetClosureIndex does.

Oh brilliant. I didn't realise that this was a thing.

I realise that inlined code is likely still faster but ideally it would be good to see some numbers on how much.

I'll try to write some code that investigates this. If just calling PETSc routines is sufficiently fast then this would dramatically simplify the code generation done by PyOP2.

Answer 5 · 2022-02-04T13:59:46.000Z

So I have done some experimentation on how efficient the different types of packing are:

Experiment

To investigate the packing speeds, I took a PETSc Vec that stored data on a cell with 1 point per cell, 1 per vertex and 3 per edge (shown below). I also gave the data a tensor structure so there were 2 DoF stored per point. This means, for example, that each edge has 6 DoF.

    +
    |\
    + +
    +  +
    + + +
    |    \
    +-+++-+

For the experiment, I looped over the cells of the mesh and incremented all of the data points defined in the closure of the Vec. This simulated the situation in PyOP2 where we read (pack) and write (unpack) a Dat where the local kernel does minimal work.

I made the mesh sufficiently big so the data structures being iterated over would not fit in the cache.

Results

My results are shown in the figures below (the right one is just a zoomed in version of the left):

The four columns correspond to:

petsc uncached: This is the naive implementation where at every iteration we call DMPlexVecGet/SetClosure.
petsc cached: This is the performance I got using PetscSectionGetClosureIndex.
hardcoded1: This was my attempt to implement the packing code that I wrote above.
hardcoded2: This tries to simulate the existing packing code that PyOP2 currently uses.

For me the key observations are:

petsc uncached is catastrophically slow. Memoization of the maps is essential for performance.
petsc cached takes over twice as long as either hardcoded option. Since packing data is the dominant cost for memory-bound codes and this cost scales with DoF count, I argue that this is a significant difference and we should not replace our packing code with calls to PetscSectionGetClosureIndex.
hardcoded1 is slightly faster than hardcoded2. I think that this proves that the 'entity-aware' approach I described above is worth doing.

Other thoughts

Having finally gotten my hands dirty writing proper PETSc code, I have gained more of an appreciation for how well put together their data model is and how it would fit with PyOP2. I therefore think that I should rephrase the overall goal of this proposed redesign from:

We should make PyOP2 aware of the mesh topology by introducing a new Mesh class as well as informing Dat objects of their data layout via a PETSc Section.

to:

We should make PyOP2 use the same data model as PETSc.

(@dham deserves all the credit for this idea since he had it a long time ago and I am only now realising what he meant by it)

Answer 6 · 2022-12-07T16:21:06.000Z

This is quite an out-of-date discussion and my pyop3 work supercedes this. Closing.