E3SM-Project/datasm

Warehouse PostProcess does not respect destination options

Opened this issue · 1 comments

The warehouse state-machine "postprocess" workflow (currently dedicated to climo and timeseries generation) appears to ignore the "-p" (publication destination) specification, and when the source material (indicated by "-w") is given to be the pub_root, will only output its products to the (dataset path) pub_root area. More seriously (unlike "warehouse publish") it does not attempt to avoid overwriting existing files or directories at the destination or to create a "next higher" version directory for output. It "adds to" (and may clobber) files in the existing destination directory. Another related issue is that the behavior appears different for atmos timeseries, land timeseries, and for climos, with atmos timeseries datafiles placed into a "v0.*" sequence of fractional-version directories (ala "warehouse"), and left there despite the existence of pre-existing v1 and v2 published directories. (This last may be due to a "silent abort" in the state-machine which failed to advance to a final roll-up of the data.)

A related issue here is the unfortunate attempt to be "flexible" when specifying either the source or destination (warehouse versus publication) of files used by and generated by PostProcess. A proposal here is to settle upon publication (pub_root) directories as the default location for datasets intended for eventual publication, and to follow such placement immediately with a mapfile-generation step, rather than generate the mapfile in warehouse and require automated editing of its content upon publication. This should reduce "variant" forms of processing and lead to more predictable results. In any case, a mapfile generated should never need to be edited to "appear" as if the hashes were produced at a different location.