drivendataorg/pandas-path

Support cloudpathlib

ejm714 opened this issue · 3 comments

Make pandas_path place nicely with cloudpathlib so we can do things like

bucket = S3Path("s3://drivendata-competition-sfp-public/")
df['s3_path'] = bucket / df.filename.path.with_suffix('.tif')

Thoughts on the API?

We could add a .cloud_path (or .cloud or .cpath or something else) accessor to explicitly do CloudPath things.

Or, we can overload the .path accessor to just handle CloudPaths. We'd do this by trying to instantiate everything as CloudPath first (and if that fails, because the protocol fails—i.e., it doesn't start with s3://, az://, etc.) instantiate it as aPath object.

In order not to muck with the underlying data types, we convert to Path when the accessor is hit (this is where we'd need some changes):

[Path(x) for x in obj.values]

We also check isinstance Path in places, and should use os.PathLike instead, which both objects will support.

And then we return everything as a str so you can continue to do normal pandas things with a string type. This should be ok, but worth thinking about:

return res if not isinstance(res, Path) else str(res)

return res if not isinstance(res, Path) else str(res)

jayqi commented

Dependency-wise, it might be easier to think about if it went the other way: cloudpathlib has an (optional?) implementation of pandas-path.

Ah, something like from cloudpathlib.pandas import cloud_path to register the accessor?