perrygeo/python-rasterstats

zonal_stats does not include expected raster cells

Tinkaa opened this issue · 4 comments

I created a toy xr.DataArray and gpd.GeoDataFrame. When applying zonal_stats to this, the results are not as expected. It identifies 2 cells within one polygon and 1 within the other. I would expect the result to have 4 cells in the one polygon, and 0 in the other. I have tried to dig into the code what causes this but haven't been able to understand it. I am guessing it has something to do with the affine or how the polygons are defined.
When I apply rio.clip, it does work as expected.
Do you know what causes this unexpected behaviour of zonal_stats?

All code is shown below

import xarray as xr
from shapely.geometry import Polygon
import geopandas as gpd
import matplotlib.pyplot as plt
from rasterstats import zonal_stats
import rioxarray

da = xr.DataArray(
[[1,2,3],
       [4,5,6]],
dims=("x", "y"),
coords={"x": [0.5,1.5], "y": [2.5,1.5,0.5]},
)

d = {'geometry': [Polygon([(0, 0), (0,2), (2, 2), (2, 0)]),Polygon([(2,0),(2,2),(3,2),(3,0)])]}
gdf = gpd.GeoDataFrame(d)

zonal_stats(
    vectors=gdf,
    raster=da.values,
    affine=da.rio.transform(),
    stats=["mean","count"],
    all_touched=True,
)

Outputs: [{'mean': 4.5, 'count': 2}, {'mean': 6.0, 'count': 1}]

#plotting to illustrate the data
fig, ax = plt.subplots()
da.plot(ax=ax,x="x",y="y")
gdf.boundary.plot(ax=ax, color='r', lw=1)
ax.set_xlim(0,4)
ax.set_ylim(0,4)

image

Hi Tinka, I am experiencing similar behavior with the zonal_stats function. Have you figured out what the issue was in the meantime?

I'm a colleague of Tinka's, answering on her behalf since she no longer works at our org -- unfortunately no, we have not looked into the issue any further.

Thanks @Tinkaa - excellent bug report and I'm sorry I didn't get to it sooner (I don't use rasterstats professionally anymore so it's challenging for me to keep up with it in my spare time)

I'm able to reproduce this bug; I'll take a look and add this to the test suite!

Figured it out. Rasterio requires arrays in (row, column) order.

The rioxarray DataArray returns its values in (column, row) order. I have no idea why it behaves this way, but it does. IMO it's a bug - every GIS program for decades has used column-row order and switching this is a very bad practice. For now, the only solution is to create your raster arrays directly or transpose the values to get the proper order.

raster=da.values.transpose(),

With that modification, your script prints the correct values

[{'mean': 4.0, 'count': 4}, {'mean': None, 'count': 0}]