perrygeo/python-rasterstats

Performance drop due to platform type check

albantor30 opened this issue · 4 comments

Since version 0.16.0, zonal_stats is very slow on large datasets. This is due to the use of platform.architecture() in gen_zonal_stats (main.py - line 184) to check if we are on a 64 bit platform. The previous method using sysinfo.platform_bits was much faster.

Here is a code snippet illustrating the huge timing difference:

$ python -m timeit -n 5 "import numpy.distutils.system_info as sysinfo; [sysinfo.platform_bits is None for _ in range(2000)];"
5 loops, best of 3: 109 usec per loop
$ python -m timeit -n 5 "import platform; [platform.architecture()[0] is None for _ in range(2000)];"
5 loops, best of 3: 5.39 sec per loop

This has been observed on Linux (Ubuntu 18.04, 20.04, 21.10, CentOS 7)

If using platform is important, platform.machine() seems to be much faster.

# If we're on 64 bit platform and the array is an integer type
# make sure we cast to 64 bit to avoid overflow.
# workaround for https://github.com/numpy/numpy/issues/8433
if masked.dtype != np.int64 and \
        issubclass(masked.dtype.type, np.integer) and \
        platform.machine().endswith('64'):
    masked = masked.astype(np.int64)

However, according to https://docs.python.org/3/library/platform.html, it is more reliable to use

sys.maxsize > 2**32

As for the timings:

$ python -m timeit -n 100 "import numpy.distutils.system_info as sysinfo; [sysinfo.platform_bits == 64 for _ in range(10000)];"
100 loops, best of 3: 492 usec per loop
$ python -m timeit -n 100 "import platform; [platform.machine().endswith('64') for _ in range(10000)];"
100 loops, best of 3: 2.55 msec per loop
$ python -m timeit -n 100 "import sys; [sys.maxsize > 2**32 for _ in range(10000)];"
100 loops, best of 3: 470 usec per loop

I can confirm the performance drop of 0.16.0. I upgraded all dependencies of something recently that had been running well for quite some years and it had a performance drop of +- 25x (instead of taking ~10 hours of processing before, now it was running for many days and still wasn't ready). After looking in more detail it narrowed down to zonal_stats()... and rolling back to 0.15.0 resolved the issue.

I didn't narrow it further down to the above issue, but most likely it will be the same.

I created a pull request that implements the sys.maxsize > 2**32 check.

For completeness, I did some further testing and this issue can indeed explain the complete performance degradation I am seeing:

  1. the way I'm using rasterstats: if you read the raster file to an ndarray before running zonal_stats 0.16.0 gives a 20x performance hit due to the new 64 bit check (0.15.0: 1 s, 0.16.0: 20 s for 20.000 polygons).
  2. if you let zonal_stats read the file again for every geometry: a factor ~4 performance hit: 0.15.0: 6.8 s, 0.16.0: 29 s for 20.000 polygons.

Finally, apparently using rasterio 1.2.10 + shapely 1.8.2 is 50% slower than rasterio 1.2.8 + shapely 1.8.0 for the 2nd case, so apparently there is a performance regression in newer versions of these libraries as well.

@theroggy @albantor30 Thanks for the analysis. That's very surprising that platform.architecture is such a resource hog, I didn't expect that. I agree with the sys.maxsize check. I'll take a look at the PR shortly. Moving the conversation over to #258