NOAA-GFDL/MDTF-diagnostics

Add capability to check for pre-exisiting processed files before running preprocessor

Opened this issue · 5 comments

What problem will this feature solve?
This will mitigate the overhead of reading and writing files during the preprocessing step every time the framework is run
Describe the solution you'd like
Per #436

...[A] situation where you run the framework with a specific POD/dataset combination with the --keep-temp flag (and --disable-preprocessor flag if you don't want any of the variable name/metadata modifications) enabled to retain the local copies of the files. Next, you want to re-run the same configuration using the saved local files. With a new flag (e.g., --no-file-rw), the framework would have to search for any preexisting working directories with the desired CASENAME for saved files. If it finds the files, it sets the file path and variable name environment variables to point to the old wk_dir, and skips the preprocessing. If it doesn't find those files, it would run the standard configuration with the preprocessor.

Additional context
This feature implementation may occur in conjunction with the separation and redesign of the preprocessor. It will also provide an opportunity to incorporate intake_esm catalogs.

This sounds more like what I'm thinking of, but the description in Issue #436 has one component that sounds problematic to me (bolded below)

With a new-flag (e.g., --no-file-rw), the framework would have to search for any pre-existing working directories with the desired CASENAME for saved files. If it finds the files, it sets the file path and variable name environment variables to point to the old wk_dir, and skips the preprocessing. If it doesn't find those files, it would run the standard configuration with the preprocessor.

Can it just look in the wk_dir it is using for this run? I fear that 1) looking in multiple places will take more time and 2) it will find old files that aren't applicable

@bitterbark The framework makes the WK_DIR first, queries the files in MODEL_DATA_ROOT and OBS_DATA_ROOT, preprocessess them by default, and writes copies to WK_DIR. Do you mean that the user would create the wk_dir ahead of time (e.g., /usr/mdtf/MDTF-diagnostics/wkdir/[CASENAME]), copy the preprocessed files to ~/wkdir/[CASENAME]/[pod name]/[model]/[ouput_frequency], and the framework would search for the files there?

I am not intending the user to do anything ahead of time. This is the workflow (user and pre-processor) I have in mind. Keep in mind this is for development so we intend to run several times (N+1 in the following)
1a. The user runs mdtf as normal.
1b. The preprocessor processes files a normal.
Na. The user runs with the --no-file-rw flag as well as overwrite = true
Nb. The pre-processor doesn't make a new WK_DIR because it already exists. It check that the needed files are in WK_DIR. Because it already ran (1b) it finds them and returns without doing anything.

Note that the user can have both of the flags on in step 1 as well, because when the pre-processor checks for the processed files and finds them not there, it could then search for model output and then make them. Much as it currently checks for the WK_DIR and doesn't make a new one when overwrite = true.

@Okay.1a, 1b, and Na fit with the current paradigm. Nb is where I'm lost. Unless--overwrite is specified, a new WK_DIR will be created, where WK_DIR is [root]/wkdir/[CASENAME]_[version number]. The _[version number] is appended if [root]/wkdir/[CASENAME] already exists, and increments by 1 each time the framework is run if [root]/wkdir is not cleaned by the user. The only way for Nb to work is if the framework queries previously created [root]/wkdir directories with CASENAME (or perhaps just for the directory with CASENAME_[current version number -1 ]) that contain saved processed data files.

But yes, I think it should be required to stay in the same wk_dir in order to do this (as in overwrite=true). This is for development, running the same thing over and over again in order to tweak/debug so I think it is normal to have overwrite=true.

The only reason to use a new wk_dir is to start from a clean slate, so in that case the pre-processor should not use previously made files. In fact, allowing it to use old files could lead to lots of errors if the user thinks they have a fresh wk_dir but an old file that was found elsewhere on the disk is being used instead.

Hopefully this makes the implementation easier. There doesn't need to be a search for files: the pre-processor presumably knows which file it is going to write so it can just look for that exactly.

And There could be a warning if you set overwrite=false and --no-file-rw=true, that it isn't going to do anything.