ecmwf/polytope

TileDB Support

itsgifnotjiff opened this issue · 8 comments

Is your feature request related to a problem? Please describe.

I wonder if we can use something else as the data source Polytope queries. Zarrs, Kerchunked files are fine, but TileDB seems to be a pretty strong alternative.

Describe the solution you'd like

Ideally I would be able to query TileDB Arrays using Polytope's API.

Describe alternatives you've considered

I know you can use GDAl, Rasterio, Fiona, PDAL, Geopandas, Xarray and others to query data extracted from a TileDB Array, but I would love to be able to query the Array itself using Polytope.

Additional context

The GRIB Jump base is awesome but almost every Meteorological Center around the world has their own format that they need to either cast to GRIB or implement the logic you've implemented. Also from what I understand Polytope works on the GRIBs but TileDB can host Raster, Tabular, Vector, Imagery and Observational dense and sparse data in the same Array so I imagine being able to ask for the Model Outputs Raster, Warning Zones Vector, Station and Satelite Observational and Satelite Imagery data within a 4D Polygon (Flight path) and get all relevant information.

Organisation

Environment and Climate Change Canada

Hi @itsgifnotjiff, thanks for your feature request! I'm not familiar with TileDB, @mathleur will check out the docs and see if its doable.

We already have an xarray backend for polytope, is it possible to use xarray as a layer to read from a TileDB? I've seen this https://github.com/TileDB-Inc/TileDB-CF-Py but no experience with it.

Thank you for your work and consideration Mr. @jameshawkes.

Yes they have integrated with xarray by specifying xr.open_dataset(uri, engine="tiledb"). But I am really curious to see if the query can not be pushed down to their C++ backend itself.

I think Mrs. Julia Dark @jp-dark , Mr. Isaiah Norton @ihnorton or Mr. Nick Kules can confirm if it is doable.

Hi @itsgifnotjiff ,

@mathleur has made a minimal example showing use of tileDB though the xarray engine:
https://gist.github.com/mathleur/7314e497d4fdd0224a3a3a8d92ba58d2

Would be good to understand if there are disadvantages to this xarray engine.

Mr. @jameshawkes and Mrs. @mathleur , this is very promissing and I will try and run some benchmarks on a 4Tb NWP Model Outputs TileDB Array. Is there any special querries you would suggest to stress test Polytope?

I would start simple with some point-profiles (e.g. time-series, vertical profiles) and some area extractions (e.g. boxes, polygons). We know that the xarray backend is not highly optimised, so I expect we will see some scaling issues at first, but we are actively working on this.

Hi @itsgifnotjiff and @jameshawkes, happy to help answer any questions regarding TileDB. You may see some decreases in performance compared to accessing TileDB directly since there are a few more steps going through the xarray backend but it should be completely doable workflow wise. The example from @mathleur looks good.

Yordan if you're using some of the datasets you've already ingested into TileDB it would just be using everything after opening the dataset in xarray (line 44 onward in the example). If you do notice that performance is outside what you'd find acceptable we're happy to help investigate to figure out where the bottleneck might be.

@jameshawkes and @mathleur sorry to bother you but i have a TileDB Array with a couple of TB of data in RottatedPole that looks like the image below and I am trying to construct some querries but understandably Polytope documentation does not give a lot of examples of how to build different queries.

If it is not too much to ask can you help me build a

  • flight path
  • Canada
  • bounding box
  • point,
  • vertical profile

and any other query you think would be pertinent to test please?

image

or even better let's use this cube

image

slicer = HullSlicer()
polytope_API = Polytope(datacube=ds.TT, engine=slicer)

canada_bbox = [-141, -50, 39, 83],

box1 = Box(["lon", "lat"], [-141, -50], [39, 83])
request = Request(box1)
result = polytope_API.retrieve(request)
print([leaf.result for leaf in result.leaves])

but I get

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[8], line 2
      1 slicer = HullSlicer()
----> 2 polytope_API = Polytope(datacube=ds.TT, engine=slicer)
      4 canada_bbox = [-141, -50, 39, 83],
      6 box1 = Box(["lon", "lat"], [-141, -50], [39, 83])

File ~/.conda/envs/polytope_exp/lib/python3.11/site-packages/polytope/polytope.py:49, in Polytope.__init__(self, datacube, engine, axis_options, datacube_options)
     46 if datacube_options is None:
     47     datacube_options = {}
---> 49 self.datacube = Datacube.create(datacube, axis_options)
     50 self.engine = engine if engine is not None else Engine.default()

File ~/.conda/envs/polytope_exp/lib/python3.11/site-packages/polytope/datacube/backends/datacube.py:156, in Datacube.create(datacube, axis_options, datacube_options)
    153 if isinstance(datacube, (xr.core.dataarray.DataArray, xr.core.dataset.Dataset)):
    154     from .xarray import XArrayDatacube
--> 156     xadatacube = XArrayDatacube(datacube, axis_options, datacube_options)
    157     return xadatacube
    158 else:

File ~/.conda/envs/polytope_exp/lib/python3.11/site-packages/polytope/datacube/backends/xarray.py:37, in XArrayDatacube.__init__(self, dataarray, axis_options, datacube_options)
     35         if self.dataarray[name].dims == ():
     36             options = axis_options.get(name, None)
---> 37             self._check_and_add_axes(options, name, values)
     38             treated_axes.append(name)
     39 for name in dataarray.dims:

File ~/.conda/envs/polytope_exp/lib/python3.11/site-packages/polytope/datacube/backends/datacube.py:76, in Datacube._check_and_add_axes(self, options, name, values)
     74     DatacubeAxis.create_standard(name, values, self)
     75 elif name not in self._axes.keys():
---> 76     DatacubeAxis.create_standard(name, values, self)

File ~/.conda/envs/polytope_exp/lib/python3.11/site-packages/polytope/datacube/datacube_axis.py:108, in DatacubeAxis.create_standard(name, values, datacube)
    105 @staticmethod
    106 def create_standard(name, values, datacube):
    107     values = np.array(values)
--> 108     DatacubeAxis.check_axis_type(name, values)
    109     if datacube._axes is None:
    110         datacube._axes = {name: deepcopy(_type_to_axis_lookup[values.dtype.type])}

File ~/.conda/envs/polytope_exp/lib/python3.11/site-packages/polytope/datacube/datacube_axis.py:120, in DatacubeAxis.check_axis_type(name, values)
    116 @staticmethod
    117 def check_axis_type(name, values):
    118     # NOTE: The values here need to be a numpy array which has a dtype attribute
    119     if values.dtype.type not in _type_to_axis_lookup:
--> 120         raise ValueError(f"Could not create a mapper for index type {values.dtype.type} for axis {name}")

ValueError: Could not create a mapper for index type <class 'numpy.float32'> for axis pres

Hi @itsgifnotjiff, the main point for Polytope shapes is to define shapes on all dimensions of the datacube. In your second datacube, this would be forecast, lat and lon. So to define:

  • a flight path: Request( Path([“lat”, “lon”], Box[“lat”, “lon”], [0,0], [1,1]), flight_points), Select(“forecast”, [0])) (Note that this will be static in time and altitude because of the provided datacube, but when there are time and altitude dimensions available, a flight path should be a 4D path, which is a sweep of a 4D box along 4D flight points)

  • Canada: Request(Polygon([“lat”, “lon”], canada_points), Select(“forecast”, [0]))

  • a bounding box: Request(Box([“lat”, “lon”], [0,0], [1,1]), Select(“forecast”, [0]))

  • a point: Request(Point([“lat”, “lon”], [[0,0]], method=“surrounding”), Select(“forecast”, [0)) (Note that this will take all the surrounding points to the requested point available on the datacube)

  • a vertical profile: This is difficult to achieve with the provided datacube, but it would probably be a Point in the lat/lon dimensions “augmented” by a Select in the “forecast” dimension (like in all the other examples above) and a Span in an altitude dimension.

Note also that the datacube needs to be defined as an Xarray DataArray instead of a Dataset currently.

Finally, to solve the shown ValueError, we've updated the develop branch, so it should now work! For any similar ValueErrors, each type of datacube dimension value needs to have an associated DatacubeAxis defined in the _type_to_axis_lookup dictionary (at the bottom of the /polytope/datacube/datacube_axis.py file). In this case, for example, the type np.float32 had be added as a FloatDatacubeAxis().

I hope this helps but I would be happy to answer more questions and will update the documentation to make these steps clearer!