fsspec/universal_pathlib

s3fs: translation of buffering=-1 to block_size causes read failures

kevinjacobs-delfi opened this issue · 2 comments

Versions:

  • fsspec==2024.9.0
  • s3fs==2024.9.0
  • universal_pathlib==0.2.5

This code in core.py is causing problems:

        if "buffering" in fsspec_kwargs:
            fsspec_kwargs.setdefault("block_size", fsspec_kwargs.pop("buffering"))

Translation from buffering -> block_size results in read failures for files opened with buffering=-1 (the default for opening files) when passed to functions like json.load for files on S3.

S3FileSystem.open does not specify what happens if a negative number of passed and I'll file a bug there too.

I don't have a public bucket to post a sample file, but I'll try to find one so that we can reproduce this behavior.

However, the obvious fix is something like this:

        if "buffering" in fsspec_kwargs:
            buffering = fsspec_kwargs.pop("buffering")       
            if buffering > 0:
                fsspec_kwargs.setdefault("block_size", buffering)
ap-- commented

Hi @kevinjacobs-delfi,

Thanks for reporting. This is a UPath bug, not an s3fs bug imho.
Would you be willing to make a PR with a test and your suggested fix?

Also, may I ask why you're passing the default buffering=-1 down to your UPath.open(...) calls? I'm curious how you discovered the bug.

Cheers,
Andreas

@ap-- I'll work on a PR today.

The context here is that I have a "smart file opener" that mimics the "open" API:

@contextmanager
def smartopen(
    file: FileOrPathLike,
    mode: OpenMode = 'r',
    buffering: int = -1,                  # <-------------
    encoding: str | None = None,
    errors: str | None = None,
    newline: str | None = None,
    *,
    compression: CompressionTypes | Literal['infer'] | None = 'infer',
    compresslevel: int = 9,
    hyphen: HyphenFileTypes | None = 'std',
) -> Iterator[IO[str] | IO[bytes]]:
    """Context manager for opening and closing compressed or uncompressed text or binary files.

So buffering=-1 is the default.