perrygeo/python-rasterstats

Poor performance for boundless reads

AsgerPetersen opened this issue · 3 comments

I would like to propose exposing a way to toggle boundless reading for two reasons:

  1. There are use cases where features being outside the raster extent is an error. For example in my job I am provided with countrywide rasters and I collect statistics from these rasters for buildings and roads. If a feature is outside the raster extent something is wrong with either the feature or the raster.

  2. Enabling boundless reading in rasterio seriously degrades performance in some cases. It looks like the dataset is opened for each feature and the block cache is effectively disabled when using boundless reading. This gist https://gist.github.com/AsgerPetersen/6f9c8120b85e462ccbc26191a2117b3a demonstrates a performance improvement about 50x when disabling boundless reading. On my real world data the performance improvement is in the order of 200x.

I implemented it for my own usage here: AsgerPetersen@c375094.

@AsgerPetersen 👍 looks like a good option for both providing more flexible edge handling and potentially a performance boost. Can you submit a PR for this? Looks ready to go. I can get this in the next release after some testing.

I'm curious about the performance degradation with boundless reads. I'll look into that as well.

Thanks!

Sure. PR in #228
I wonder if it could be an idea to implement boundless reading from rasterio datasets the same way as for numpy arrays: https://github.com/perrygeo/python-rasterstats/blob/master/src/rasterstats/io.py#L165. It seems that the strategy used by rasterio for boundless reading isn´t very performance friendly in the case of repeated reads.

The boundless=False option is now in master. I'm going to repurpose this issue as a place to discuss the rasterio boundless read performance. As you suggested, there might be some workarounds to implement on the rasterstats side.