isciences/exactextract

Python: Perf thoughts

Opened this issue ยท 3 comments

๐Ÿ‘‹ Feel free to ignore me but was just reading through the code

You're currently using the OGR bindings or Fiona to load a GeoJSON FeatureCollection to Python. Have you looked at pyogrio at all? For loading entire files it can be easily 5-10x faster than fiona because it's vectorized instead of a python loop.

In my experience, I've found that serializing GeoJSON between Python objects and native code can be really slow. What would you think about an API that (optionally) returned indices into the passed-in objects instead of the features themselves so that users can skip the overhead of serializing the data back to Python?

It would be great to release the GIL while in pure C++ code, especially if you're able to pass a whole chunk of rasters/vectors at once to C++, so you aren't acquiring and releasing the GIL on every iteration of the loop. Looking at the pybind docs it says

pybind11 will never implicitly release the GIL

so I currently your code always holds the GIL?

If you had interest, I'd love to discuss how exactextract could use GeoArrow. It's an exact candidate for why I think GeoArrow has so much potential: existing binary data in GEOS objects in GeoPandas or Shapely has to serialize to GeoJSON and the GEOS -> GeoJSON -> exactextract -> GeoJSON -> GEOS conversion is really slow. There's ongoing work to handle GeoArrow <--> GEOS interop in C/C++, which would simplify getting geometries back into GEOS for use in core exactextract.

You're currently using the OGR bindings or Fiona to load a GeoJSON FeatureCollection to Python

If the OGR bindings are available, I'm exporting the geometry as WKB

def geometry(self):
return bytes(self.feature.GetGeometryRef().ExportToWkb())

which is then read by GEOS:

if (py::isinstance<py::bytes>(geom)) {
py::buffer_info info = py::buffer(geom).request();
m_geom = GEOSGeomFromWKB_buf_r(m_context,
(const unsigned char*)info.ptr,
info.shape[0]);

Have you looked at pyogrio at all? For loading entire files it can be easily 5-10x faster than fiona because it's vectorized instead of a python loop.

I hadn't looked at it, but it appears that you can use it like this:

>>> from pyogrio import read_dataframe
>>> from exactextract import exact_extract
>>> polys = read_dataframe('ne_10m_admin_0_countries.shp')
>>> country_pop = exact_extract('gpw_v4_population_count_rev11_2005_2pt5_min.tif', polys, 'sum')

It would be great to release the GIL while in pure C++ code, especially if you're able to pass a whole chunk of rasters/vectors at once to C++, so you aren't acquiring and releasing the GIL on every iteration of the loop

This isn't something I know much about, but I can l look into it. Having a benchmark would be helpful.

If you had interest, I'd love to discuss how exactextract could use GeoArrow. It's an exact candidate for why I think GeoArrow has so much potential: existing binary data in GEOS objects in GeoPandas or Shapely has to serialize to GeoJSON and the GEOS -> GeoJSON -> exactextract -> GeoJSON -> GEOS conversion is really slow.

I'd be happy to discuss it. To be clear, the conversion happening in this package would only be OGR Geometry -> WKB -> GEOS, unless the user has decided not to use OGR. I didn't spend much time looking at conversion options for fiona and GeoPandas because it hasn't been a significant part of runtime in my testing. I'm guessing it's somehow unsafe to just grab the GEOS geometry pointers directly from a GeoPandas dataframe?

I'm guessing it's somehow unsafe to just grab the GEOS geometry pointers directly from a GeoPandas dataframe?

I think this works on conda but maybe not pypi...I think PyGEOS used to do it with Shapely before the two merged. Shapely has a C API for this but I don't know the details of how/when to use it from another package.

I'm guessing it's somehow unsafe to just grab the GEOS geometry pointers directly from a GeoPandas dataframe?

I believe this is only valid when you can ensure that the underlying GEOS libraries are exactly the same, but in general you can't know what version the other library is using.

PyGEOS and Shapely used to have a warning when they checked that the versions of GEOS were not the same that conversions between the two would be slower. Presumably that meant serializing to WKB and deserializing.