perrygeo/python-rasterstats

Potential refactor for large speed boost

sdtaylor opened this issue · 5 comments

Hi, I had to do some optimizations recently for zonal stats on a very large number of polygons, and I thought the method might fit well here.
Essentially, instead of iterating thru each feature one at time, I use the rasterio.features.rasterize function to label each pixel of a raster with an id for all features. Then some array reshaping is used to create a sorted masked array, and all stats can then be generated in a vectorized fashion. It offers huge performance boosts when feature counts become large (see figure below). I setup just a few stats (['mean','max','min','sum','std']) to compare with the original zonal method just as a proof of concept.

image

Some downsides of this:

  • It creates a copy of the original raster array, so will be more memory intensive with larger rasters.
  • It will load a larger chunk of the raster, also using more memory than 1 feature at a time.
  • it won't work with overlapping polygons as is.
  • rasterizing all features at once means a pixel can only be assigned to a single feature, whereas before a pixel split by 2 polygons could be "shared". I suspect this is the cause in some slight differences in the final stats, but in my comparisons the differences are very tiny.

You can see the code here https://github.com/sdtaylor/python-rasterstats/tree/zonal_refactor_poc
And a testing/comparison script here below.
@perrygeo any interest in having this in rasterstats somehow?

Test code

import numpy as np
import rasterstats
import rasterio

from shapely.geometry import Point, MultiPoint, box, MultiPolygon
from shapely.ops import voronoi_diagram

import pandas as pd

from timeit import default_timer as timer

test_tif = './tests/data/slope.tif'


def generate_random_polygons(n, bounds, seed=10):
    """
    Use random points and voronoi diagrams to create
    numerous polygon shapes completely covering everything
    within the  bounds 
    

    Parameters
    ----------
    n : int
        number of polygons.
    bounds : tuple
        (minx, miny, maxx, maxy)
    seed   : numeric
        random seed

    Returns
    -------
    [Polygon]
        List of polygons

    """
    minx, miny, maxx, maxy = bounds
    
    rng = np.random.default_rng(seed)
    
    points_x = rng.uniform(minx, maxx, n)
    points_y = rng.uniform(miny, maxy, n)
    
    bounding_box = box(*bounds)
    point_geom = MultiPoint([Point(x,y) for x,y in zip(points_x, points_y)])
    geoms = [g.intersection(bounding_box) for g in voronoi_diagram(point_geom).geoms]
    return geoms

with rasterio.open(test_tif) as src:
    tif_profile = src.profile
    tif_bounds  = src.bounds

#----------------
# Time both methods
stats = ['mean','max','min','sum','std']

timing_results = []
comparison_results = {s:[] for s in stats}


for n_polygons in [10,50,100,500]:
    print(f'testing with {n_polygons} polygons')
    geoms = generate_random_polygons(n=n_polygons, bounds=tif_bounds, seed=None)
    
    # Drop some geoms so the raster is not 100% covered
    geoms = geoms[5:]
        
    original_start = timer()
    original_method = list(rasterstats.main.gen_zonal_stats(geoms, test_tif, stats=stats, all_touched=False))
    original_end   = timer()
    original_total = round(original_end - original_start,5)
    
    
    exp_start = timer()
    exp_method = rasterstats.main.gen_zonal_stats_experment(geoms, test_tif, stats=stats, all_touched=False)
    exp_end  = timer()
    exp_total = round(exp_end - exp_start, 5)
    
    timing_results.append({
        'n_polygons' : n_polygons,
        'original_time' :original_total,
        'experimental_time' : exp_total
        })

    # Compare the difference in final numbers between the original method
    # and new method.
    for i,o,e in zip(range(len(original_method)), original_method, exp_method):
        for s in stats:
            if o[s] is None and e[s] is None:
                comparison_results[s].append(0)
            else:
                comparison_results[s].append(o[s] - e[s])


pd.DataFrame(timing_results).plot(x='n_polygons', y=['original_time','experimental_time'],ylabel='seconds', kind='bar')

for stat, differences in comparison_results.items():
    print(f'{stat} largest abs difference: {np.abs(differences).max()}')

Sdtaylor, that was a good idea...
From an old fellow 😉

@sdtaylor that's a great result! By constraining the problem a bit, you've found some solid speedups here. I have some technical questions about the benchmarks, and some thoughts on including in python-rasterstats...

benchmark questions

I haven't run the code yet so I'm going on the assumption that the speedups can be credited to two things:

  1. a single rasterio.rasterize call (1 numpy allocation) vs one call per feature.
  2. less raster IO, a single big read vs one read per feature

I'd love to see the breakdown of the timing. In my experience, 2) is less of an issue in real-world datasets because GDAL's caching behavior and efficient block IO mean that, in aggregate, there is little advantage to loading in-memory rasters vs relying on GDAL's RasterioIO mechanism. It largely depends on how well the raster format is laid out on disk and the IO bus or network in between. The slope.tif is a toy test dataset does not demonstrate block IO well. Which is to say, it would be good to test this on more realistic raster datasets including s3 (COGS) to see if it holds.

The rasterize-all-at-once optimization is really a good one. Again, I'm interested in seeing more granular timing so we could find out if e.g. it was the rasterization algorithm or the numpy array allocations (and subsequent garbage collection) that are slowing things down. But that's mostly curiosity - I think it's clearly a speed win in certain cases.

including it in python-rasterstats

This is a great optimization, but it does change the behavior so I don't think it belongs in the current zonal_stats function. I think we should add it as a separate function with a distinct name (zonal_stats_full_coverage?) and we can factor out common pieces later if needed.

The additional constraints are acceptable in some circumstances but users should be aware that the results will never be the same as the other algorithms. Specifically, not handling overlap of any kind (neither at the geometry or pixel level) runs the risk of "losing" features - IOW we can no longer guarantee that the set of input features will correspond 1:1 to the output stats. Worse, it means that any overlap could hide parts of other valid polygons, making the statistics invalid. For many applications of rasterstats, this would be a non-starter.

This also means that the polygon order has an impact on the result. Another caveat to document - the geometries need to be sorted in priority order if you suspect overlap. This could bias the results as geometries painted last would have slightly larger areas (they would "win" the edge pixels).

It's not clear that every use case would see improvement either; reading the entire raster is not always a good move unless you have close to 100% coverage of the area. Sparse datasets without full coverage will be wasting IO by fetching the entire raster. Just a caveat to document - applying this optimization really depends on knowing both your raster and vector dataset well.

The unconstrained memory usage is pretty concerning. Most production use cases of rasterstats that I've personally worked on involve some sort of container infrastructure like k8s where you're likely to be memory constrained and dealing with arbitrary inputs - pure speed is less important than staying within some memory bounds since that characteristic allows you to scale it economically. Example: if you can run 10 containers x 1Gb in 6 minutes vs 1 container with 64Gb for 1 minute, the horizontally-scalable-but-60x-slower solution still wins in terms of economics (assuming $ ~= GB/min). As I'm often dealing with arbitrary user-input rasters and geometries, having the ability to keep memory reasonably constrained is critical.

This optimization is firmly in the "1 raster band, many geometries" camp. There is much interest in the opposite problem, "a few geometries, thousands of raster bands" which would clearly not be optimal for this approach. Another caveat for the docs.

So, it's got different (potentially incorrect) behavior, optimized for a polygon dataset with nearly full coverage, and takes more resource usage... It's not going to replace zonal_stats for every use case. To be completely honest, I don't see myself using this in its current form. Speed is important but correctness and scalability are moreso ... for the type of work I do which is more backend web services. If you're on mondo workstation machine and trying to crank through a big analysis in a notebook and you're manually inspecting your inputs and outputs ... this could be perfect. Different operational environments have different tradeoffs.

ideas

We could potentially modify the algorithm to solve the correctness issues somehow. Or at least produce a warning along the lines
of

Input: 345 features, area 1000000 sq units
Output: 326 features, area 889000 sq units (Ignoring 19 features and losing 11000 sq units of coverage)

If we could track the features that were hidden or whose footprint was altered by overlap, we could run those geometries separately. Tracking the feature shapes and running those overlapping checks might be rather expensive and negate all the optimizations though. 🤔

Another thought, this doesn't have to be an all-or-none proposition. We could split the space up into quadrants or smaller tiles and work in chunks (with some overhead to track the edges). This could help keep the memory constrained.


In any case, thanks @sdtaylor for this work. I'm happy to include it as an alternative zonal_stats function under a new function name!

I have interest in this. Our use case is composed of many geometries , analyzed against several single-band rasters.

If it's possible to cache the rasterized version of the polygons between runs, it would be even better in our scenario.

Happy to help!

Very interested in the performance improvements mentioned here! Are there any updates since last year?

In particular, @hardcore-code and @george-silva's point about caching the rasterized polygons would offer significant improvements for my use case where I'm recalculating zonal stats in the same locations across a time series.