ioos/compliance-checker

Requests to ERDDAP TableDAP and Desired Response Format

Closed this issue · 3 comments

Per the discussion in the ERDDAP Google Group,

https://groups.google.com/d/msg/erddap/kyYF8wcNeME/p5TSngqPAwAJ

The default ERDDAP response for a format-agnostic TableDAP request is essentially a raw OPeNDAP sequence. The resulting dataset is one with s. pre-pended to all the variable names:

<xarray.Dataset>
Dimensions:                                                  (s: 27617)
Dimensions without coordinates: s
Data variables:
    s.time                                                   (s) datetime64[ns] ...
    s.latitude                                               (s) float64 ...
    s.longitude                                              (s) float64 ...
    s.z                                                      (s) float64 ...
    s.backscatter_intensity_2651_a                           (s) float64 ...
    s.sea_water_velocity_to_direction_2651ds_a               (s) float64 ...
    s.sea_water_speed_2651ds_a                               (s) float64 ...
    s.eastward_sea_water_velocity_2651ds_a                   (s) float64 ...
    s.northward_sea_water_velocity_2651ds_a                  (s) float64 ...
    s.sea_water_pressure_cm_time__standard_deviation_2651_a  (s) float64 ...
    s.sea_water_pressure_2651_a                              (s) float64 ...
    s.sea_water_temperature_2651_a                           (s) float64 ...

(dataset used in Google group example, courtesy R. Signell)

This can cause several checks to fail for at least two reasons:

  1. If a variable s.a has an attribute which refers to another variable by name, b, and must use b in a check, b no longer exists but instead is replaced by s.b

  2. The dimensions of the OPeNDAP sequence are almost guaranteed to not be CF DSG-compliant

This raises the all-important question: What format should we request when supplying an ERDDAP URL to the Compliance Checker?

ERDDAP can provide the same dataset in 44 different formats. Typically, when running the checker against a dataset, the .ncCF (NetCDF Contiguous Ragged Array) data format or .ncCFMA (NetCDF CF Discrete Sampling Geometries File) format work best, since the binary data is encoded in such a way that the features have a single instance dimension. Other formats, such as the .nc (plain, table-like NetCDF-3 format) align the variables using different dimensions (this does not alter the geospatial location of the data in any way) which may cause the dataset to fall out of compliance.

I am wary to say, "Let's just request the .ncCF or .ncCFMA" because I am not yet convinced that changing the format doesn't equate to changing the data. If this were true, could compliance simply be reached by reshuffling the dimensions? I think it's an important point to discuss.

Furthermore, requesting the .ncCF or .ncCFMA format isn't just as easy as appending the extension to the ERDDAP URL -- the file will probably have to be created locally on each run of the checker. See this preliminary result:

$ python cchecker.py -t ioos "https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/WQB-04.ncCF"          
Running Compliance Checker on the datasets from: ['https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/WQB-04.ncCF']
syntax error, unexpected $end, expecting ';'
context: Error { code=404; message="Not Found: Currently unknown datasetID=WQB-04.ncCF";}^
Traceback (most recent call last):
  File "cchecker.py", line 231, in <module>
    sys.exit(main())
  File "cchecker.py", line 205, in main
    options=options_dict)
  File "/home/dalton/compliance-checker/compliance_checker/runner.py", line 65, in run_checker
    ds = cs.load_dataset(loc)
  File "/home/dalton/compliance-checker/compliance_checker/suite.py", line 716, in load_dataset
    return self.load_remote_dataset(ds_str)
  File "/home/dalton/compliance-checker/compliance_checker/suite.py", line 727, in load_remote_dataset
    return Dataset(ds_str)
  File "netCDF4/_netCDF4.pyx", line 2321, in netCDF4._netCDF4.Dataset.__init__
  File "netCDF4/_netCDF4.pyx", line 1885, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -90] NetCDF: file not found: b'https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/WQB-04.ncCF'

@daltonkell Good description of the situation.

I am wary to say, "Let's just request the .ncCF or .ncCFMA" because I am not yet convinced that changing the format doesn't equate to changing the data. If this were true, could compliance simply be reached by reshuffling the dimensions? I think it's an important point to discuss.

For the case of the 'Single Platform' rule in question, I don't think working with any of ERDDAP's various formats should change the data for a non-compliant dataset such that it could be made to be in compliance.

Take a DSG timeSeries dataset with one 'station dimension' (call it station):

In the .nc version of the file ERDDAP just expands the variables that vary according to the station dimension and repeats them for each step in the time-varying dimension (call it time). So if you have a dataset with variables:

double station(station=1);
double z(station=1);
double latitude(station=1);
double longitude(station=1);
double sea_water_temperature(station=1, time=2621);

The result is that the same value is repeated for each of station, z, latitude, and longitude 2621 times in the .nc output. Because this dataset represents one DSG timeSeries feature (aka station), therefore it doesn't (or shouldn't) change position. If you had a file with different values of each of station, z, latitude, and longitude, it would logically have to have a station dimension > 1, and would therefore fail our 'Single Platform' dimensionality test (with the one exception we make for timeSeries mentioned below).

TimeSeries example:

Here's a test dataset I created that also shows this (although in this case the 'station dimension' is 2 because it represents two different sensor depths on the same platform). But the principle is the same, in the .nc output format of the file (shown by just the ERDDAP header response), the station variable (with cf_role=timeseries_id) has dimension length of 5242:

http://testing.erddap.axds.co/erddap/tabledap/sun2wave_timeseries_micah.ncHeader

In the .ncCFMA output format however it has dimension length of 2 (one for each sensor height):

http://testing.erddap.axds.co/erddap/tabledap/sun2wave_timeseries_micah.ncCFMAHeader

This dataset therefore has 'station dimension' = 2, and only meets the 'Single Platform' rule in our profile because this is the special exception we're making for timeSeries datsets to accommodate for sensors at different heights on the same platform - the lat/lon values could technically vary as well, but they shouldn't since it's the same physical 'platform'.

The htmlTable output from the dataset shows what I mean (note the repeated values for those variables):

http://testing.erddap.axds.co/erddap/tabledap/sun2wave_timeseries_micah.htmlTable?time,station,z,latitude,longitude,sea_water_velocity_to_direction,sea_water_speed,sea_water_temperature,sea_surface_wave_significant_height,peak_wave_period&time%3E=2018-11-01T17%3A52%3A00Z&time%3C=2018-11-10T17%3A52%3A00Z

So, as long as you test either .ncCF or .ncCFMA, you're getting the proper 'station' dimension representation of the dataset from ERDDAP, whatever the number of them is.

Internally, ERDDAP looks for whichever variable is labeled with cf_role and uses unique values from that to determine the length of the stations dimension(s) to include in the .ncCF and .ncCFMA outputs.

So, long way to say, I guess I don't see any other options. Hopefully this helps make the point though.

Also, regarding:

If a variable s.a has an attribute which refers to another variable by name, b, and must use b in a check, b no longer exists but instead is replaced by s.b

Using one of these two types should solve the s.variable_name problem as well, shouldn't it? Otherwise, not sure what can be done to handle that.

@mwengren This is a great explanation, thank you!

Using one of these two types should solve the s.variable_name problem as well, shouldn't it? Otherwise, not sure what can be done to handle that.

Yes, for sure. This takes the raw OPeNDAP output and will make it into the conventional format we're used to. I'm going to check out how we might facilitate something like this, we may have to get creative. Pinging @benjwadams and @Bobfrat to keep them in the loop.

Closing this issue as it relates to #801