pik-primap/primap2

sparse data

Opened this issue · 15 comments

we need more efficient handling of sparse data. Thinking about sparseness for every data set is worse than just having algorithms which handle sparse data.

In xarray (https://docs.xarray.dev/en/stable/internals/duck-arrays-integration.html?highlight=sparse#integrating-with-duck-arrays and https://github.com/pydata/xarray/blob/main/xarray/tests/test_sparse.py) this functionality is still experimental, but it could be pretty drop-in. Maybe we should see if we can make a primap array sparse and see if our tests still pass, that would be great!

I set high priority for the test described above. If it's more work priority is medium as currently I can usually avoid memory issues by reducing dimensionality early.

Because sparse data was not originally a design goal for primap2, I expect it to be quite some work. If the xarray drop-in I highlighted above works, we still get the maintenance burden of using an experimental functionality in production. If the drop-in doesn't work, we probably have to fix things in xarray itself.

I think it would be reasonable to test the experimental functionality a bit, get a feel for how experimental it truly is, and then decide what we will do.

Another strategy for solving this problem is a “divide and conquer” approach. We do this already when processing e.g. by country so that dimensions don't blow up. There is now an interesting project within the xarray ecosystem to provide tools to do exactly that: https://github.com/xarray-contrib/datatree . One of their examples is CMIP6, where each model has slightly different variables and different resolution etc. Using datatree, the data is grouped into one datatree consisting of multiple xarray datasets, where each dataset uses its own dimensions, but operations on the whole datatree are still possible.

Worth to check out if this is a viable solution for us.

Definitely worth looking at. It would introduce a further distinction between types of metadata (nodes (e.g. source, scenario, and model), variables (entity), dimensions (category, time, ...)) making some functionality more complicated to implement, but keeping large datasets as one object without having to manually split datasets might make up for that.

At least for the "primap database" with all bells and whistles, with data from CRF, EDGAR, etc. all in the database, we need something which is higher-level than an xarray dataset anyway, I think. But that is not a current problem (while the sparseness problem is a current problem).

Another option to consider: copy/vendor something out of scmdata.

These docs in scmdata show we deal with the sparsity issue in a slightly opaque way (search for 'sparse' to find the most relevant section) https://scmdata.readthedocs.io/en/latest/notebooks/netcdf.html#splitting-your-data

A better example is probably the below. The key output is that the array is 10 times smaller when you make it sparse and has 11 times fewer nan's when it's non-sparse. I think we should consider how we can do the same trick with primap data to avoid the memory issues we've discussed. scmdata implementation code is here https://github.com/openscm/scmdata/blob/7813392bf974ce9454975423518c9249200cd1f9/src/scmdata/_xarray.py#L11C45-L11C51 (lots of steps, but not terribly written)

We'd have to test how it actually works, but it might 'just work' as xarray does some smart things with collapsing dimensions and multi-indexing in my experience.

Example code with scmdataSomething like this shows the sparsity handling with scmdata

import numpy as np
from scmdata import ScmRun, run_append


generator = np.random.default_rng(0)


def new_timeseries(  # noqa: PLR0913
    n=100,
    count=1,
    model="example",
    scenario="ssp119",
    variable="Surface Temperature",
    unit="K",
    region="World",
    cls=ScmRun,
    **kwargs,
):
    """
    Create an example timeseries
    """
    data = generator.random((n, count)) * np.arange(n)[:, np.newaxis]
    index = 2000 + np.arange(n)
    return cls(
        data,
        columns={
            "model": model,
            "scenario": scenario,
            "variable": variable,
            "region": region,
            "unit": unit,
            **kwargs,
        },
        index=index,
    )


large_run = []
# 10 runs for each scenario
for sce in ["ssp119", "ssp370", "ssp585"]:
    large_run.extend(
        [
            new_timeseries(
                count=3,
                scenario=sce,
                variable=[
                    "Surface Temperature",
                    "Atmospheric Concentrations|CO2",
                    "Radiative Forcing",
                ],
                unit=["K", "ppm", "W/m^2"],
                paraset_id=paraset_id,
            )
            for paraset_id in range(10)
        ]
    )

large_run = run_append(large_run)

# also set a run_id (often we'd have paraset_id and run_id,
# one which keeps track of the parameter set we've run and
# the other which keeps track of the run in a large ensemble)
large_run["run_id"] = large_run.meta.index.values

# This is super sparse because run_id and paraset_id have lots of non-overlapping
# combinations
sparse_xr = large_run.to_xarray(dimensions=["scenario", "run_id", "paraset_id"])
print(f"{sparse_xr=}")
print(f"{sparse_xr.sizes=}")
print(f"{sparse_xr.isnull().sum()=}")

# The extras argument allows you to get around this by effectively saying which
# dimensions can be inferred from others. With this information, we can avoid
# the sparsity issue
non_sparse_xr = large_run.to_xarray(dimensions=["scenario", "run_id"], extras=["paraset_id"])
print(f"{non_sparse_xr=}")
print(f"{non_sparse_xr.sizes=}")
print(f"{non_sparse_xr.isnull().sum()=}")

print(
    "Ratio of nulls in sparse vs. non-sparse"
    f"{sparse_xr.isnull().sum() / non_sparse_xr.isnull().sum()}"
)

@znicholls Nice, thanks

just to understand the strategy, I'll try to summarize it and you can tell me if I misunderstood:

  1. We identify the sparse dimensions. Generally, a sparse dimension is one where if you project this dimension out, the relative sparsity of the array decreases significantly.
  2. Then, we identify all combinations of the sparse dimensions which are actually filled with data.
  3. We assign a primary key to these data points, the important thing is only that the values are unique, otherwise we don't care for the contents of the primary key. Small integers are probably a good choice for readability, but UUIDs would also be fine technically.
  4. We now use this primary key as the dimension in the xarray.DataArray, and demote the other dimensions it replaces to mere coordinates which use the new primary key as their dimension.

Right?

That's quite nifty. If you use set_xindex() to add an index on the primary key, you can even select efficiently and easily on the sparse coordinates:

In [16]: non_sparse_xr = non_sparse_xr.set_xindex("paraset_id")

In [17]: non_sparse_xr.loc[{"paraset_id": 8}]
Out[17]: 
<xarray.Dataset>
Dimensions:                         (time: 100, run_id: 9, scenario: 3)
Coordinates:
  * time                            (time) object 2000-01-01 00:00:00 ... 209...
  * run_id                          (run_id) int64 24 25 26 54 55 56 84 85 86
  * scenario                        (scenario) object 'ssp119' 'ssp370' 'ssp585'
  * paraset_id                      (run_id) int64 8 8 8 8 8 8 8 8 8
Data variables:
    Atmospheric Concentrations|CO2  (scenario, run_id, time) float64 nan ... nan
    Radiative Forcing               (scenario, run_id, time) float64 nan ... ...
    Surface Temperature             (scenario, run_id, time) float64 0.0 ... nan
Attributes:
    scmdata_metadata_region:  World
    scmdata_metadata_model:   example

This could honestly be a low-effort route that we can drop in with only 1 or 2 days of work and see how it works out in daily usage.

@JGuetschow do you have a nice sparse dataset I can play with? I can of course easily build an artificial sparse dataset, but if you happen to just have something around which is real data and sparse, I could use that and be closer to real usage than with something artificially generated.

Any of the output/BUR_NIR_combined_all_XXXX datasets in the UNFCCC_data repo (it's private but you should have access)

Right?

You got it (and whether that primary key is another dimension or something we add automatically doesn't really matter but was something that was a bit of mucking around to automate from memory).

If you use set_xindex() to add an index on the primary key, you can even select efficiently and easily on the sparse coordinates

That's cool. Maybe xarray has thought about this problem too and has an even better solution as that API didn't exist when we first wrote it I don't think and it seems quite specific to such handling...

This could honestly be a low-effort route that we can drop in with only 1 or 2 days of work and see how it works out in daily usage

Nice

So I played around a little with this while waiting for Excel things in ndcs land.

It works in principle, but there are definitely some things to work out still:

  • The auxiliary dimension has to remain visible as a coordinate, otherwise alignment will not work at all. Also it has to be a pandas MultiIndex if I understand things correctly.
  • This means we have to standardize the naming of the auxiliary dimension and allow it in the data format. Not difficult, but work to do.
  • This also needs special-casing in conversions to interchange format - hardly surprising, but still work to do.
  • Problematic is that for alignment during arithmetic and similar operations, the auxiliary dimension is used, not the coordinates defined on it. In practice that means combining multiple datasets which both have auxiliary dimensions has to be done with care to make sure the auxiliary dimensions are not overlapping. Maybe we can make a standard which works surprisingly often (something like encoding the values of the real coordinates into the auxiliary dimension) , but in general I fear there might be surprises. Sure, non-dimension coordinates are a standard thing in xarray, but I'm not sure if they are supported as well as proper dimension coordinates.
  • We also need to think a bit about the point when we define auxiliary coordinates - do it too early, and you have to re-define them when combining multiple datasets (see prior point). Do it too late and you are already wasting lots of memory before conversion.
  • A central API we would need to see if it can be made to work would be (da.pr.set)[https://primap2.readthedocs.io/en/stable/usage.html#Setting-data].
  • Stuff like da.pr.loc does not work yet, because it works on dimensions, not coordinates. Probably rather easy to fix.

The alignment of auxiliary coordinates between datasets is a critical point, because merging two datasets is currently the main memory intensive task.

Yes, it would be good to understand if the workflow is "200 country-specific dense datasets are combined in one step into one sparse dataset" or if we need to efficiently combine two sparse datasets. There will always be a compromise between alignment speed, memory usage, and merging speed. (e.g. if merging two sparse datasets just is "appending more rows" like it is in pandas long format, merging speed is good, memory usage is okay-ish and alignment speed is terrible).

Both example workflows are used already. And often there is both "appending" and alignment going on in one workflow (though usually in different steps, so one could convert back and forth between formats). But if usage starts to be a hassle because of converting back and forth it's also a problem.

Yeah, that just means we need to find a reasonable compromise, and can't optimize for alignment or merging alone. So, neither pandas long format is good for us (we knew that from our initial assessment of different data formats in 2020) nor dense xarray (we learned this the hard way after deciding data formats). And I hope we can find a solution where the user doesn't have to convert between data formats a lot.