google/orbax

Make GCS-style checkpointing configurable instead deciding it on is_gcs_path function output

qGentry opened this issue · 4 comments

Hi, i'm using s3fs which has very similar restrictions as GCS (no atomic moves, for example).
To enable GCS-style checkpointing I'm patching orbax which seems very ugly

def patch_with_gcs_style_checkpointing_for_path_prefix(prefix):
    def is_gcs_path(path: str) -> bool:
        return os.fspath(path).startswith(prefix)

    ocp.utils.is_gcs_path = is_gcs_path
    
patch_with_gcs_style_checkpointing_for_path_prefix("/mnt/s3")

Instead, I would prefer to have a parameter in ocp.CheckpointManagerOptions that would enable this checkpointing style.
Its default value may be False and in ocp.CheckpointManager.__init__ there could be self._gcs_style_checkpointing = gcs_style_checkpointing or is_gcs_path(directory) which would be totally backward compatible.

Curious for more context on why you want this. The only thing this really controls is whether the COMMIT_SUCCESS.txt file is written or atomic rename is used to indicate atomic save success, so are you just wanting COMMIT_SUCCESS.txt file in non-GCS contexts?

I'll add that atomic rename on GCS is supposed to be supported as of recently, so I was actually planning to get rid of COMMIT_SUCCESS.txt behavior entirely and replace everything with atomic rename.

Having two different ways of indicating whether a checkpoint is complete just creates compatibility problems and adds more complexity for us and users.

so are you just wanting COMMIT_SUCCESS.txt file in non-GCS contexts

Yes, exactly. Writing straight to final directory and marking writing as "finished" with COMMIT_SUCCESS.txt file.

Having two different ways of indicating whether a checkpoint is complete just creates compatibility problems and adds more complexity for us and users.

I understand your concern regarding complexity and compability, but there is a few extremely popular storage backends (which orbax might want to support in the future, i hope so at least) that doesn't have atomic rename/atomic rename, like Amazon S3 or any S3-compatible storages. And GCS right now.

If you're really commited to end with only one of behaviors (tmp dir with atomic rename or COMMIT_SUCCESS.txt file), is there any problem to stick to only COMMIT_SUCCESS.txt behavior? This would automatically support both storage types but i might be missing some in-depth context on why that's is not universal.

Hi, apologies for the delayed response. I'm sympathetic to this request - I think we can allow this behavior to be configurable. However, it isn't so easy to provide, unfortunately, because it's not just the finalize behavior that would need changing, it's also the logic to create tmp directories, check for finalized vs. unfinalized directories, and one place in CompositeCheckpointHandler. (Some other places in CheckpointManager itself, but that's less of a concern.)

So I'm not that keen on plumbing this option to all of those places, especially since we could imagine another FR along the lines of "enable marking commit success by writing to an external database". Probably the way we would want to go is to provide some sort of Storage class with methods for ensure_atomic_save and is_finalized/is_tmp, which is more modular and overridable.

Concern with going with COMMIT_SUCCESS.txt globally is for legacy compatibility - it becomes difficult to distinguish a checkpoint written successfully before the change from a checkpoint written unsuccessfully after the change. Probably it's not impossible though, so we can consider that approach.

I don't think we'll have time to prioritize this right away, but it'll be on our TODO list.

Ok, thank you!