perrygeo/python-rasterstats

Using rasterstats with multigeometries on very large datasets

yyu1 opened this issue · 9 comments

yyu1 commented

Hi perrygeo,
I have used rasterstats with some very large raster data (100m resolution global data) using zonal_stats, and I have ran into issues. I was able to find the problem and made some modifications on my fork of rasterstats that seems to be working.

However, I didn't think about if it might break other intended functions of rasterstats, so I wanted to open an issue here to discuss what are your thoughts and your intended usage types for python-rasterstats.

Typically, when I wanted to do some type of zonal statistics on very large raster sets, I write my own C++ code using boost::accumulators. I think that is really the best way to calculate statistics on such large datasets, and it runs circles around python in terms of speed. However, I thought I would give rasterstats a try since I have used it on some smaller datasets.

I ran into the following problem:
Running zonal_stats would have the python memory usage balloon up until the kernel kills the process. Activity monitor was showing memory usage > 100Gb.

Digging into the code further, I found what was happening was due to MultiPolygon types in the vector file. It happens in this portion:

geom = shape(feat['geometry'])
if 'Point' in geom.type:
    geom = boxify_points(geom, rast)
geom_bounds = tuple(geom.bounds)
fsrc = rast.read(bounds=geom_bounds)

The problem is, let's say I have a very large raster data (mine is about 130 Gb), and there is a MultiPolygon feature that has polygons spread out around almost the entirety of this raster.
In this case geom.bounds is going to give a giant rectangle that includes all the polygons, even though the actual area of coverage for those polygons are not that big.
The code then tries to read all that into memory, and later on, it gets even worse as the code tries to make nodata and NAN masks of the same size. This approach is never going to work with large rasters with MultiPolygons spread out.

The way I fixed this was to iterate through each polygon in a MultiPolygon, read the raster into numpy array, flatten that array into 1D, go to next polygon, do the same, and append the 1D array to the end. The flattening is to make the append operation possible, and we don't really need the relative positions of the pixels in 2D space. The same is done for the nodata array and NAN array.
At the end, just go back to the same code to calculate the statistics on the 1D array.

The questions now is:

  1. Does the flattening break other intended uses of zonal_stats?
  2. Does it even make sense to do this for python-rasterstats? Perhaps this use case is better suited for a C/C++ program.

Hope you can provide some feedback regarding this approach and your intended usage for python-rasterstats.

yyu1 commented

I think this is related to #206 . The slowness and hanging is very likely due to MultiPolygon type using large amounts of memory.

Excellent observations on the memory usage with a geographically-disperesed multigeometries - thanks @yyu1 !

I'd say the "flattening" approach could work very well. In more general terms, the data & mask arrays for any Multi* geometry could be calculated by looping over it's single-part geometries.

I'm concerned that it might be an optimization for this one use case but hurt performance for other cases. And code complexity is always a concern. But I'd need to see the implementation to understand the implications.

If we could implement it as an optional keyword argument to zonal_stats(..., split_multi_geom=False), that would be ideal as it would allow us to evaluate both side-by-side for a while until we were sure about it's accuracy and performance. Would this be possible @yyu1 ?

yyu1 commented

@perrygeo I agree that an optional keyword is the best way to test this out and not affect the processing by default.

It should be fairly straightforward, because the way I implemented it was to add an if block that checks if the geometry is MultiPolygon. It should be just a matter of adding if Multipolygon and split_multi_geom. My fork is some commits behind the main branch. I'll try to catch my fork up to the main branch, add that optional keyword and make a pull request.

Hi @yyu1 and @perrygeo, what is the status of this feature? I just ran into this memory issue when trying to calculate the zonal stats for multiple polygons. In my case I have a vector file with 5564 polygons and a 145152x123854 raster file and I cannot compute the zonal stats of the entire vector on a 128GB machine due to insufficient memory. However, if I take only part of the vector file, which has a much smaller bounding box, then zonal stats runs successfully. So, I think that the proposed strategy would be very useful.

yyu1 commented

@daaugusto Thanks for bringing this up again. I had actually forgotten that I had agreed to add this feature. I haven't used rasterstats in a couple of years so it had escaped my mind. I'll see if the code I wrote is still compatible with any changes that might have occurred during this time.

@yyu1 - I'm facing the very same issue. Excellent discussion and thanks a ton for bringing this issue up.
If the suggested enhancement feature gets accepted, it would be very helpful of course.

Hi @yyu1, thanks in advance for your effort! I believe this is a great contribution to rasterstats and its community!

yyu1 commented

@perrygeo I have submitted a pull request for this feature #255

Thanks! I will take a look this weekend.