NVIDIA/NeMo-Curator

[FEA] Update read_json to work with s3 paths.

ayushdg opened this issue · 0 comments

Is your feature request related to a problem? Please describe.

Currently there is logic in both get_all_files_under & read_json that relies on the files being present locally and doesn't work cleanly with s3. Since dask/cudf/pandas already support reading from s3 via fsspec or s3fs Curator should update some of the methods here to allow passing in the s3:// path and reading directly from s3.

Describe the solution you'd like
Using existing curator scripts/examples, DocumentDataset.read_json and get_all_files_under work with s3 paths.

Describe alternatives you've considered
The alternative is for users to directly use a different library or the dask api to read the datasets in and then create a documentDataset with that.

Additional context
Add any other context or screenshots about the feature request here.