xarray-contrib/xbatcher

Performance benchmarks

Opened this issue · 6 comments

Xbatcher is meant to make it easy to generate batches from Xarray datasets and feed them into machine learning libraries. As we wrote in its roadmap, we are also considering various options to improve batch generation performance. I think it's clear to everyone that naively looping through arbitrary xarray datasets will not be sufficiently performant for most applications (see #37 for examples / discussion). We need tools/models/etc. to handle things like caching, shuffling, and parallel loading and we need a framework to evaluate the performance benefits added features.

Proposal

Before we start optimizing xbatcher, we should develop a framework for evaluating performance benefits. I propose we setup ASV and develop a handful of basic batch generation benchmarks. ASV is used by Xarray and a bunch of other related projects. It allows writing custom benchmarks like this:

example 1:

class HugeAxisSmallSliceIndexing:
    # https://github.com/pydata/xarray/pull/4560
    def setup(self):
        self.filepath = "test_indexing_huge_axis_small_slice.nc"
        if not os.path.isfile(self.filepath):
            xr.Dataset(
                {"a": ("x", np.arange(10_000_000))},
                coords={"x": np.arange(10_000_000)},
            ).to_netcdf(self.filepath, format="NETCDF4")

        self.ds = xr.open_dataset(self.filepath)

    def time_indexing(self):
        self.ds.isel(x=slice(100))

    def cleanup(self):
        self.ds.close()

We could do the same here, but with a focus on batch generation. As we talk about adding performance optimizations, I think this is the only way we begin to evaluate their benefits.

Is there a way to have a public record of the benchmarks? I'm thinking of something like what https://codecov.io is to pytest-cov. I found airspeed-velocity/asv#796 which is a GitHub Action solution, but was wondering if there's a nicer way to track performance over time with each merged PR on a line chart.

There's no current public record. I didn't prioritize publishing results because it seemed the lack of dedicated, consistent hardware would be a barrier to useful records. But https://labs.quansight.org/blog/2021/08/github-actions-benchmarks suggests that GitHub actions could be sufficient to identify performance changes >50%.

That's a really nice blog post, thanks for sharing! The GitHub Actions doesn't look trivial to setup though 😅 I did find https://github.com/benchmark-action/github-action-benchmark but they don't support asv (yet). Maybe we should find a way to piggyback on to https://pandas.pydata.org/speed/xarray?

After #168 we'll have a pretty good suite of benchmarks.

The following two tasks remain for closing out this issue:

  • Periodically run benchmarks in CI to identify any issues with the asv setup or performance regressions
  • Configure asv to compare new Xarray releases, since xbatcher's performance is so tied to Xarray's

There's no current public record. I didn't prioritize publishing results because it seemed the lack of dedicated, consistent hardware would be a barrier to useful records. But https://labs.quansight.org/blog/2021/08/github-actions-benchmarks suggests that GitHub actions could be sufficient to identify performance changes >50%.

We're starting to experiment with using pytest-codspeed at PyGMT for benchmarking (see GenericMappingTools/pygmt#2910 and GenericMappingTools/pygmt#2908). CodSpeed seems to solve the problem of inconsistency by measuring CPU cycles and memory accesses instead of execution time, but this can be less intuitive in some cases, since more CPU cycles used doesn't always mean slower execution time.

If there's interest, I can help with setting up the CI infrastructure for CodSpeed this year. This would require some refactoring of the current benchmarks from ASV to pytest-benchmark, but this would allows us to track performance benchmarks publicly like CodeCov (see https://codspeed.io/explore), rather than having to compare runs locally. Thoughts anyone?

There's no current public record. I didn't prioritize publishing results because it seemed the lack of dedicated, consistent hardware would be a barrier to useful records. But https://labs.quansight.org/blog/2021/08/github-actions-benchmarks suggests that GitHub actions could be sufficient to identify performance changes >50%.

We're starting to experiment with using pytest-codspeed at PyGMT for benchmarking (see GenericMappingTools/pygmt#2910 and GenericMappingTools/pygmt#2908). CodSpeed seems to solve the problem of inconsistency by measuring CPU cycles and memory accesses instead of execution time, but this can be less intuitive in some cases, since more CPU cycles used doesn't always mean slower execution time.

If there's interest, I can help with setting up the CI infrastructure for CodSpeed this year. This would require some refactoring of the current benchmarks from ASV to pytest-benchmark, but this would allows us to track performance benchmarks publicly like CodeCov (see https://codspeed.io/explore), rather than having to compare runs locally. Thoughts anyone?

I recently started using CodSpeed for ndpyramid after reading your comment and it seems really neat! I agree that it could work well for xbatcher, since the necessity for locally running the benchmarks is a barrier to use. It's also nice that the same code can be used for tests and benchmarks. Fully support you trying it for xbatcher!