JCSDA-internal/eva

Time series in EVA?

Closed this issue · 7 comments

Dooruk commented

I would like to start a discussion regarding EVA handling timeseries and creating plots with time axes. Creating time series plots is something I started doing with JEDI outputs on our end but I would like to do it in a generic way if there is demand/need. I had some brief discussions here with @danholdaway and @asewnath (she suggested that I create an issue) and I would really appreciate EMC's input on this so I'm tagging all who might be interested: @CoryMartin-NOAA @EdwardSafford-NOAA @kevindougherty-noaa @guillaumevernieres @ADCollard.

A simple example plot is mean (obs - bkgr) over multiple time steps in observation space. IodaObsSpace class is already handling JEDI outputs (location, channel dimensions) in observation space so this would be only a matter of adding a dateTime dimension. The issue with EVA is that there are no time series handling and an improvement to EVA in this regard would benefit everyone in terms of DA monitoring. EVA currently reads data, makes necessary transforms, and makes the plots inside a certain folder, say geos_ocean for our case, at a single cycle.

I have issues in terms of cycling and file storage so I have to erase JEDI outputs frequently for high resolution simulations that spans couple of months. Hence, ideally EVA needs to handle files during the active cycle before they get erased. This may or may not be relevant for the other developers.

Below is what I'm suggesting (based on our workflow):

For Swell workflow, our run directory currently looks like this (for ocean-only DA cycling) :

├── run
│   ├── 20210601T120000Z
│   │   └── geos_ocean
│   ├── 20210601T180000Z
│   │   └── geos_ocean
│   ├── 20210602T000000Z
│   │   └── forecast
|   |   └── geos_ocean

forecast directory contains GEOS related files/outputs whereas geos_ocean has JEDI related configs/outputs. EVA runs inside geos_ocean at each cycle and produces fabulous plots.

My suggestion is having an extra folder (call it diagnostics or holding) on the same level as time:

├── run
│   ├── diagnostics
│   ├── 20210601T120000Z
│   │   └── geos_ocean
│   ├── 20210601T180000Z
│   │   └── geos_ocean
│   ├── 20210602T000000Z
│   │   └── forecast
|   |   └── geos_ocean

So EVA would process IODA outputs and create netcdf file(s) within the diagnostics folder with time (and channels etc. if needed) dimension(s). Afterwards, it may append and update these file, or create a new one every so often, you get the idea. This would require some capabilities within EVA in addition to datasets, transforms, graphics, such as write to output (which may be a time sink unless handled in parallel). It would be great addition with the EVA interactive tool as part of DA monitoring.

I'm open to suggestions, thoughts, criticisms..

We're interested (in a parasitic sort of way) for sure, @Dooruk . We run EVA every cycle as well, but I can't say our plots are fabulous yet!

@Dooruk There actually is some capability for time series plotting in eva. The transforms/select_time.py provides a means of selecting a requested variable either as a single time or a time slice. I've not worked with IotaObsSpace but MonDataSpace does a lot of what you describe with adding a datetime dimension which then allows the select_time transform to create selected subsets of data by cycle time(s).

As far as data output from eva, the DA monitoring effort will require something like that eventually. We have been focused so far on using eva to create plots from the legacy DA monitor data, leaving the existing data extraction mechanisms in place for now. Replacing the data extraction will be our last step and we haven't yet charted that out, but would certainly be interested in collaboration if possible.

Thanks @Dooruk . I've thought about this in the past. I was thinking something like the following:

  • OOPS application to read in IODA diagnostics, compute statistics, and produce a much smaller IODA file with just summary information
  • Use EVA to concatenate the above outputs and produce time series from them

Basically, as part of a workflow, the IODA diagnostic files would go into an IODA reader, compute counts, mean O-F, std dev, whatever else is needed, and write out a file that is "cycle" dimension instead of nlocs. This can either be 1 file/cycle or appended with ncrcat or something like that. Then EVA plots variables from this file on a line plot with minimial changes needed in EVA (maybe only the reader since IODA requires "nlocs", although I guess we can still use that dimension and then have a cycle/time variable?)

I guess, what I am saying is, I always saw this more as an IODA problem, not an EVA problem, but if we want to do the preprocessing as part of EVA, I'm open to that approach, I just figured compiled code here would be faster.

Dooruk commented

Thank you for the comments. @EdwardSafford-NOAA, IODA files have a single time step at each cycle so not sure if that would work but if I will keep 'MonDataSpace' in mind.

@CoryMartin-NOAA, I really like this idea, that would save significant time in terms of creating/writing files. I am not familiar with the inner workings of IODA so not sure how much effort is required to tackle it. I would be interested in helping if/when it comes to that.

@AndrewEichmann-NOAA kindly "volunteered" to help out put together the oops application @Dooruk .

I like the idea of having an oops application to perform the time series for observations. Thanks for volunteering effort on this @AndrewEichmann-NOAA and @guillaumevernieres.

To have more generic time series capability in eva I think we need to have it at a level above the read and transforms. I started working on this at one point but didn't have time to finish and then wanted to wait until after all of Akira's refactoring, which is now complete. The kind of YAML structure I had in mind was along the lines of:

timeseries:
  times: [time1, time2, ...]
  filenames: [file1, file2, ...]
  dataset_template:
      name: ...
      type: ...
      dataset_specific_things: ....

  # List here the variable you want to compute at each time and keep
  transform_and_keep:
  - transform: minmaxmean
    metrics: [mean, max]
    along_dimension: []
    variable: collection::group::variable

  # List any variables that you want to keep the entire thing each time. Otherwise everything is deleted
  # Somehow add a time dimension, though TBD exactly how.
  keep_variables: []

# More transforms if you like
transforms: 

# Graphics
...

I was thinking that if you had timeseries in the yaml you wouldn't be able to put anything for dataset, it would loop over the files of the timeseries and create a dataset for each.

The behaviour would also trigger deleting everything except what is created in transform_and_keep or things you especially want to keep in keep_variables. This would avoid mounting memory use.

Let me know what you think on this approach. I can work on it while EMC work on the OOPS level approach. The advantage of the Eva way would be that it could be applied to any of the data we can read in Eva. For example accumulating all the convergence rates over many cycles or to make Hovmöller type plots.

Dooruk commented

@danholdaway thanks for chiming in. In terms of specifying times in the YAML, are you thinking this timeseries task would be executed periodically and then create/append an output file, as opposed to every cycle (what I originally had in mind)? Or the task just creates graphics and does not write any files?