[FEA] Update read_json to work with s3 paths.
ayushdg opened this issue · 0 comments
Is your feature request related to a problem? Please describe.
Currently there is logic in both get_all_files_under
& read_json that relies on the files being present locally and doesn't work cleanly with s3. Since dask/cudf/pandas already support reading from s3 via fsspec or s3fs Curator should update some of the methods here to allow passing in the s3:// path and reading directly from s3.
Describe the solution you'd like
Using existing curator scripts/examples, DocumentDataset.read_json
and get_all_files_under
work with s3 paths.
Describe alternatives you've considered
The alternative is for users to directly use a different library or the dask api to read the datasets in and then create a documentDataset with that.
Additional context
Add any other context or screenshots about the feature request here.