XPublish: Platform standardization improvements

Question

XPublish: Platform standardization improvements

Opened this issue 5 months ago · 5 comments

Project Description

Discuss and decide some standard configurations for xpublish deployments. XPublish is being hosted by a few groups right now in prototype. To increase adoption, a recommended "standard" deployment option should be documented, along with a standard Dockerfile. We want to continue to enable xpublish to run on many cloud platforms but it would be nice to make it easier for people to try without having to create a Python environment.

One opinionated deployment we have been using is XREDS.

Expected Outcomes

Dockerfile for running Xpublish
Documentation for deploying Xpublish
Documentation for running Xpublish locally using Docker
Simple way to point to a cloud dataset and host the data

Skills required

Python, Docker

Expertise

Intermediate

Topic Lead(s)

Jonathan Joyce jonmjoyce, Matt Iannucci mpiannucci

Relevant links

https://github.com/xpublish-community/xpublish
https://github.com/asascience-open/xreds

Answer 1 · 2024-04-04T20:14:51.000Z

I'd be interested in contributing here (likely remotely).

Answer 2 · 2024-04-04T20:30:52.000Z

@cjolson64 I'd like to be involved in contributing (remotely) to this.

Answer 3 · 2024-05-03T17:42:48.000Z

Thank you for taking the time to propose this topic! From the Code Sprint topic survey, this has garnered a lot of interest.

Following the contributing guidelines on selecting a code sprint topic I have assigned this topic to @jonmjoyce . Unless indicated otherwise, the assignee will be responsible for identifying a plan for the code sprint topic, establishing a team, and taking the lead on executing said plan. The first action for the lead is to:

Create a Sprint Topic website - https://github.com/ioos/ioos-code-sprint/blob/main/CONTRIBUTING.md#creating-sprint-topic-webpages

Answer 4 · 2024-05-17T17:16:54.000Z

@jonmjoyce, The description in this issue is a little different than the one in the project website. There, you mentioned the following goal:

Initially, we will discuss the long-term vision for the project to help determine the milestones and architecture.

I'd be very interested to hear the group's thoughts on how this development might alter the following schematic:

That image is pretty dated, but it's still in use. @srstsavage had a hand in the old 52N server back in the day.

As Xpublish + plugins become better tested and more easily deployable, we need a better description (both text and graphical) of how they integrate into and possibly replace part of the tech stack in that picture.

It's beyond the scope of the code sprint, but as more data moves into the cloud—or perhaps into all the clouds—we'll need to think more about data architecture.

Is the data architecture the same when considering two use cases:
- Serving subsets of data via Xpublish and friends or
- Computing on the data in the cloud through cloud-hosted Jupyter instances?
How many copies do we need?
What is the tradeoff between cost and performance when replicating data in multiple regions within and across many cloud providers?
What can/should be done by NOAA instead of replicated across RA's? (I'm mostly thinking about creating catalogs and owning the workflows to make the data sets more ARCO-ified.)
And, of course, my favorite question: are we considering the differences between massive gridded data sets from models and satellites and small but complicated in situ data sets (buoys, grab samples, shore stations, ship-based measurements, gliders, etc.)?

Thanks for your efforts in leading this development, and I look forward to hearing about the group's progress!

Answer 5 · 2024-05-22T18:45:19.000Z

Is the data architecture the same when considering two use cases:
Serving subsets of data via Xpublish and friends or
Computing on the data in the cloud through cloud-hosted Jupyter instances?

The resting data architecture (the data lake) should be the same. Both workflows can access the cloud-optimized data directly, ideally through catalogs like intake. In addition, Jupyter instances can pull from Xpublish (for example to get a subset), but more complex calculations should still access the ARCO data directly.

How many copies do we need?

Technically one. What are NOAA's requirements? We can have one copy and apply policies through the cloud provider to implement cross-regional backups, disaster recovery, high-availability, etc.

What is the tradeoff between cost and performance when replicating data in multiple regions within and across many cloud providers?

Storage costs are more or less fixed at $0.01 / GB / copy. 2 copies costs twice as much to store.
The big variable is data transfer sizes. For example, if everyone starts using XPublish to pull entire data files, that will be a non-starter due to the egress costs. Among trusted partners, we can configure shared networks so that the data is not going out to the Internet and not incurring egress costs. Same thing would happen to the raw files, but we get the NODD exception right now.
Cross-cloud provider transfers will cost a lot as well; it's in the vendor's interest to lock you into their network.

What can/should be done by NOAA instead of replicated across RA's? (I'm mostly thinking about creating catalogs and owning the workflows to make the data sets more ARCO-ified.)

Regional data replication on the cloud is one option in the case of heavily used data. But thinking that most data usage is real-time (latest forecasts and obs), it might make sense to just regionally cache the recent products for usage by xpublish, and consider the archive as a different (and cheaper) storage solution. A tiered approach, with IOOS defining requirements for hot, warm, and cold data.

And, of course, my favorite question: are we considering the differences between massive gridded data sets from models and satellites and small but complicated in situ data sets (buoys, grab samples, shore stations, ship-based measurements, gliders, etc.)?

We have a good handle on the model workflows now, and I think integrating these other datasets will be a key prototype to explore. But more or less we can follow similar patterns, just adjusting the underlying ARCO data model to the data (instead of kerchunk). The workflow is notify -> ARCO -> data lake. Then pull from services, which could be more xpublish plugins to support those data types, but doesn't necessarily need to be the same project as they're different access methods versus grids.