ioos/ioos-code-sprint

[Project Proposal]: ERDDAP data harvesters/translators (aka Getting data from ERDDAP into NOAA Operational systems)

Opened this issue ยท 24 comments

Project Description

Problem Statement: RAs leaning hard on ERDDAP to standardize data access and archiving, to be human- and machine-readable. There are federal entities (USGS, NWS, CO-OPS) that have interest in some of our datasets (maybe for operational integration, but often for other purposes... to feed other products they create, e.g. CO-OPS coastal inundation dashboard), but they have strict requirements for how they ingest data.

Code Need: Harvesters or translators that are built to grab RA data from their ERDDAPs and transform it to meet the federal entity's specifications.

Expected Outcomes

1-2 packages that pull data from ERDDAP and transform them into a desired format.

Skills required

experience with python

Difficulty

Easy

Topic Lead(s)

unknown

Relevant links

No response

what are the "desired format"s?

@MathewBiddle ditto - what are the "desired" formats? Echoing so I am notified of reply. Thanks

The "desired" formats are TBD. This issue came up at the IOOS Spring Meeting. Essentially, all the RAs are using ERDDAP, and while some federal programs do pull from there (e.g., NDBC), there are others who want/would want our data, but have been unwilling to harvest from ERDDAP (one's mentioned at the meeting, though not fully verified, are NWS, USGS). I have not had those conversations directly, myself. But, I could do some outreach to those who made this plea at the meeting and try to get a meeting with the data folks from 2-3 of those programs to define the problem of why they don't harvest from ERDDAP, so the code sprint could create harvesters/translators that work for each group.

@RJCarini - Thanks. Besides making the sprint more efficient, knowing the formats may clarify if they are just not understanding how to use ERDDAP, if it is something that ERDDAP can't do, or if it is something that a client they are using can't do. For example, they may be using a python client and the client does not provide options for format download.

No criticism here, just would like t get a better understand of the issues. Thanks.

100% agree. I threw this idea down so that I wouldn't forget about it in the chaos of writing the IRA proposals. This thread has reminded me to reach out to other RA people and ask who they were talking to so we can figure out specifics. If we can't get to the specifics, it's not a useful project.

@RJCarini I think I misunderstood our original conversation. I thought you were talking about a difficulty pulling from various sources to integrate into the NANOOS ERDDAP. Is that a different issue or a non-issue?

@dpsnowden That's a non-issue. Troy's team is great at writing harvesters that pull data from wherever we need to. What I was hearing at the meeting was some concern around how successful the IOOS RAs can be at getting federal programs to absorb the RA-produced data. I don't have a specific example in my head anymore, but I remember stories from San Diego where one RA said they tried to get a federal program to ingest data from the RA ERDDAP and they either didn't want to or decided they couldn't interact with ERDDAP. Again, I need to go do some homework on this to make it a worthwhile project idea. Without specifics, this isn't an actionable project.

Rather than gathering the specifics before the project, perhaps identifying RAs (or others) willing to commit to participating in this project as co-designers partnered with the technical team memebers?

@RJCarini I know there has been some work on this topic in other venues, so perhaps the topic has evolved. I agree with @7yl4r that having a list of the specific data sets and interested staff is an excellent first step. The general goal of getting all relevant observational data into the operational systems is very important to us (i.e., assimilated into numerical models or visible in the NWS AWIPS system). The solution is usually half technical and half knowing the right people to ask. Let's talk about this more at the DMAC workshop.

You're also right that there are two parts to the problem. 1) Getting the data into the operational systems, and 2) getting the modelers to use them.

I am following up on some of yesterday's breakout session conversations. Much of the conversation focused on getting data from the RA-hosted ERDDAPs through NDBC and the operational (or developmental) NOAA modeling systems. We discussed two ways to make data available to NOAA operations at NCEP.

  1. Get observations on the Global Telecommunications System. This makes data available to any National operational center (in the US, that's NWS and Navy). We will focus on NWS.
  2. Additionally, a more direct route from NDBC to the NCEP data tanks doesn't require going to the GTS to get to NCEP. While this might be available, we all agreed that going through GTS was better since it published data to a broader audience. FWIW, some of the HFR data is pulled directly from the NDBC THREDDS server into the NCEP data tanks for assimilation into the WCOFS model, bypassing the NWS Telecommunications Gateway.

Below is a general schematic showing the data flow. (No idea why that figure is obnoxiously large!).

The step between NDBC --> NWSTG involves translating the from an ERDDAP output (e.g., CF-netCDF) to a WMO-compatible message type (Traditional ASCII Codes or BUFR). This might be the focus of this code sprint. Can we create a community package that translates from CF-netcdf to BUFR (TAC is being deprecated and isn't an option) that we can all contribute to and that NDBC can use?

Another use case for this project comes from another IRA observing goal. In addition to Spotter Buoys, RAs will deploy land-based water level sensors. The WMO standard for water level data is not BUFR. Unfortunately, it's SHEF, but the concepts are similar. CO-OPS has experience with this and may be interested in helping. SECOORA has been leading the planning for this, and there is an active NWS customer who needs the data, so this is a high priority. The data flow into the Hydrology side of NWS is different, which may argue for considering these two separate projects. Regardless of whether we choose to tackle the wave buoy and/or water level data flow problem as a code sprint project or not, it's important to IRA and we're going to need to figure it out.

Projects to build on:

  • BUFR Topic on GitHub
  • Toolbox (bufrtools)developed by Axiom Data Science (@kwilcox, @srstsavage, @kthyng) to encode Animal Telemetry tracks into BUFR
  • ERDDAP - We've talked about having ERDDAP output bufr/grib/shef directly but I don't think it got very far (@rmendels does this ring a bell?)
  • Is this the future for operational data assimilation? IODA
  • ECMWF Pandas to bufr tool

spotter-to-GTS

To further what I mentioned in the breakout session, any buoy or coastal land-based station data (excluding water level data) can and should be on the RA ERDDAP servers for NDBC to pull. From there, NDBC will run the data through our real-time system to create the BUFR messages. Those BUFR messages will then be pushed to the NWSTG, which pushes the data to the GTS. So there's a split that happens at NWSTG that sends the data to the NCEP data tanks as well as to the GTS. I've included the data flow diagram from my slides during the breakout session.

NDBC_IOOS_dataflow

We're actually ingesting a few GLOS spotter buoys that were deployed in the lakes over the winter season. What that means is that we have an established pathway to get spotter buoy data through NDBC to NWSTG/GTS. That particular path will most likely change from SFTP to the GLOS ERDDAP (Seagull) sometime this summer/fall, once we are able to finalize that transition.

My hesitation with having the BUFR data already created by ERDDAP is that we can't then ingest that into our real-time system. For new datasets that aren't buoy or coastal land-based stations (like ATN), that would be a great option because that data does not go through our real-time system.

My other concern is with the newer BUFR templates and their compatibility with AWIPS. We found out after the fact that AWIPS is very behind on being able to ingest BUFR messages when we started pushing out our west coast Saildrone in BUFR only. We're actually having to work with the AWIPS team on a work-around by creating TAC messages that will stay within NWSTG only, just so AWIPS can actually see the data. Lesson learned that even though NDBC is ready to release BUFR messages, other parts of NWS are not ready yet. I heard that AWIPS might be updated with the new BUFR templates maybe by the end of 2024.

I would only add only two comments to the one above from @dpetraitis-noaa. The first is it is stated:

"FWIW, some of the HFR data is pulled directly from the NDBC THREDDS server into the NCEP data tanks for assimilation into the WCOFS model, bypassing the NWS Telecommunications Gateway."

Can I point out that is done using OPeNDAP, and any ERDDAP can serve as an OPeNDAP end point so if they can pull from TDS they can pull from an ERDDAP. We know of other modelers that pull from ERDDAP, my hunch is it would take a lot less effort to work with the modeling centers to pull from ERDDAP than trying to implement the other suggestions.

The second is the Navy is working on some new readers and writers for ERDDAP having to do with BUFR and GRIB, but I am not certain which formats are for which end. I also don't know which parts they can make public, you would have to ask them.

Just contributing a use case that arose in a meeting I had yesterday, the FDA is building a food safety data portal that would seek to harvest from sccoos ERDDAP.

@ianbrunjes Do we know which data formats they'd be looking for? Any that ERDDAP doesn't already offer?

@ianbrunjes Do we know which data formats they'd be looking for? Any that ERDDAP doesn't already offer?

I don't know yet, but I would expect ERDDAP suite to be sufficient/adaptable enough. I can follow up when I've met with their programmer.

I highly doubt the FDA is looking for either BUFR or GRIB and there are any number of ways to harvest from an ERDDAP. If people in the FDA need help, let us know. Also clarify from them what they mean by harvest. If it is for use in a web page they can embed the access within the webpage. Either way, if there are any questions or problems let us know.

@dpetraitis-noaa From your recent post, you don't need another BUFR converter tool for this specific purpose (Spotter wave buoys to the NWSTG, GTS, and AWIPS). NDBC currently has the capability, and there is no need for a new development effort. Is that true?

Is there anything else you'd like the RA's to know about NDBC's ability to scale up quickly? If 50 new buoys were deployed in the next year, would NDBC be able to process the increased volume? Are there differences in the metadata format of different Spotter buoys? Is there anything related to the process of getting new WMO IDs? I want to write a short HOWTO doc for our website so the RA's know how to do this.

@RJCarini If we don't need new software, then this may not be a code sprint topic. We do have the outstanding task to put you and others in touch with the wave modelers at NCEP/EMC to discuss the sampling mode issue you raised. That probably shouldn't be a Github issue here, I'll introduce that topic over email.

We still need to figure out how to encode and distribute water level data.

@dpsnowden Agree. I don't think the issue of getting the Backyard Buoys Spotter wave data to NDBC and then to GTS needs a Code Sprint. Seems like the components are all in place, but there is some person-to-person legwork we should pursue, like you mentioned.

@dpetraitis-noaa I am curious about Derrick's question re: scale and Backyard Buoys. And Derrick, I think the howto doc you mentioned would be much appreciated.

@dpsnowden We can work together on a howto doc for the RAs regarding the spotter buoy data. Bottom line is that the data and metadata needed is no different than any other moored buoy. As for scaling up, that's something NDBC would have to look at to make sure we can handle that type of increase. Offhand, I don't think it would be that much of an issue but our IT personnel would know better than me.

For this year's Code Sprint, does someone want to take the lead on this topic? Will this be a topic to execute during the sprint or somewhere else?

Expectations for topic leads: Leads are expected to identify a plan for the code sprint topic, establish a team, and take the lead on executing said plan.

For more information on how topics will be selected see the contributing guide

@MathewBiddle After the Dmac annual meeting, we decided we don't need a Sprint for this. It is already taken care of, we just need to communicate with NDBC for new assets. @dpsnowden @dpetraitis-noaa Am I right?

Thanks, @RJCarini. I will mark this as not executing for this sprint.

@MathewBiddle After the Dmac annual meeting, we decided we don't need a Sprint for this. It is already taken care of, we just need to communicate with NDBC for new assets. @dpsnowden @dpetraitis-noaa Am I right?

@RJCarini Pretty sure the answer is yes, we don't need this and we need to communicate more with NDBC. Although, we haven't really progressed on how to do this. I'm not sure if this is something we need to do soon, or if we can wait until the IRA grants are acutally funded and the working groups start making plans.

@dpsnowden The Backyard Buoys DMAC team has been making good progress converging to its ERDDAP dataset structure and metadata fields. Once we have a few sample sets for buoys that are deployed this spring with the intention of being maintained annually, we will reach out to @dpetraitis-noaa and Bill to assign WMO Platform Codes and to test ingestion. I'm not sure of the timing of all of this, but we will keep chugging along with the funds we have and keep our fingers crossed that IRA will come through and bring the full dream to life.