pangeo-forge/pangeo-forge-recipes

Support incremental appending

rabernat opened this issue · 9 comments

Currently, when a recipe is run, it will always cache all of the inputs and write all of the chunks. However, it would be nice to have an option where, if the target already exists, it only write NEW chunks. This raises some design questions.

  • Currently, the target is never read until we start to execute the recipe (not until the prepare_target stage). However, for this to work, the iter_inputs() and iter_chunks() methods needs to know which inputs and chunks to process. In order to build the pipeline for execution, this information needs to already be inside the recipe object. So this implies that we need open the target in __post_init__. Could this cause problems?
  • How do we align the recipe with the target? For the standard NetCDFZarrSequential recipe, it may be as simple as looking at the length of the sequence dimension: if the target has 100 items but the recipe has 120, we assume the last 20 need to be appended. But are there edge cases to worry about?

This intersects a bit with the "versioning" question in #3.

If we agree on the answers to the questions above, I think we can move ahead with implementing incremental updates to the NetCDFZarrSequentialRecipe class.

Appending is significantly more complicated for the case discussed in #50: variable items per input. In this case, we don't know the size of the target dataset from the outset, so we can't use simple heuristics like the one I proposed above to figure out the append region.

Maybe recipes can implement their own methods for examining the target and determining which input chunks are needed? For that, it seems like the recipe would have to know more about the inputs than just a list of paths. For instance, it might have to understand time. What if input_urls were a dictionary rather than list, and the keys held some semantic meaning that could be used to compare to the target?

Maybe instead of looking at what has been produced, we could look at what has been consumed? For instance, after running a recipe, we could store the list of input files that were processed, so that when we get a new list next time the recipe is run, we look at the already processed input files and restore a "resuming" state. That could be done by having a "dry run", that would run the recipe without actually producing anything. We would still have the list of total processed input files, which might be needed when we "finalize" the target.
I don't know where we could store the list of processed input files, probably alongside the target, that seems the more natural. On the source side, tying a recipe to a target doesn't seem right.

This is a good idea David. Perhaps we could store the list of input files directly in the target dataset metadata itself (attrs). This would be useful for incremental appending but also for general provenance tracking.

Perhaps we could store the list of input files directly in the target dataset metadata itself (attrs).

I was thinking about that, but what if the target is not in the Zarr format? Do other formats all have metadata that we can use for this purpose? I don't know COG that much, but I'm not sure it does for instance.

The input hashing stuff introduced by @cisaacstern in #349 should make this doable. The user story for this is being tracked in pangeo-forge/user-stories#5.

Charles, would you be game for diving into this and developing a prototype?

In order for this to work within pangeo-forge-recipes entirely (without external information from the database/orchestration layer), we'll need to leave some metadata (i.e. recipe and/or pattern hashes) in the target store. Based on reading the thread, it seems like it could be okay to put this in .zmetadata?

That sounds reasonable to me. I think in general we should be injecting extra metatdata into the datasets we write. Stuff like

{
    "pangeo-forge:version": 0.6.2,
    "pangeo-forge:recipe-hash": "a1b2c3",
    "pangeo-forge:input-hash": "..."
}

Addressing that as a standalone issue would be a good place to start.

Per conversation at today's coordination meeting, people felt it would be simpler to have a single tracking issue for appending, so closing this and directing further discussion to #447.