Given JSON input is too large when using "cudf" backend
Closed this issue · 3 comments
Describe the bug
I'm trying to load some data from jsonl files using the cudf
backend but I get an error of "ValueError: Metadata inference failed in read_single_partition
." and "RuntimeError('CUDF failure at: /__w/cudf/cudf/cpp/src/io/json/nested_json_gpu.cu:86: Given JSON input is too large')".
Steps/Code to reproduce bug
files = get_all_files_paths_under("/path/data") # ['/path/data/file1.jsonl', ... , /path/data/fileN.jsonl']
meta = {
'field1': np.dtype('int64'),
'fieldN': np.dtype('O')
}
dataset = DocumentDataset.read_json(input_files=files, backend="cudf", add_filename=True, input_meta=meta)
Expected behavior
I expect the data to load as well as when using pandas
backend but "faster" and making use of GPUs parallelization.
Environment overview (please complete the following information)
- Environment location: enroot + pyxis on slurm cluster
- Method of NeMo-Curator install: NeMo Framework docker image, 24.05llama3.1, NeMo Curator 0.4.0
Additional logs
ValueError: Metadata inference failed in `read_single_partition`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
RuntimeError('CUDF failure at: /__w/cudf/cudf/cpp/src/io/json/nested_json_gpu.cu:86: Given JSON input is too large')
Traceback:
---------
File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/utils.py", line 195, in raise_on_meta_error
yield
File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/core.py", line 7175, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
File "/usr/local/lib/python3.10/dist-packages/nemo_curator/utils/distributed_utils.py", line 267, in read_single_partition
df = read_f(file, **read_kwargs)
File "/usr/local/lib/python3.10/dist-packages/cudf/io/json.py", line 96, in read_json
df = libjson.read_json(
File "json.pyx", line 45, in cudf._lib.json.read_json
File "json.pyx", line 137, in cudf._lib.json.read_json
Paraphrasing some points after discussing with @ayushdg. cuDF had an issue where it could not read individual jsonl files larger than 2.1 GB . See: rapidsai/cudf#16138.
cuDF fixed this in: rapidsai/cudf#16162 that should be a part of the release next week. Couple of steps:
- For the time being we recommend splitting the large file into smaller chunks using NeMo Curator's
make_data_shards
functionality. - We'll test if the cuDF fix works as expected & suggest using the latest nightly containers once a new version of cuDF is out in a week. We'll update this issue then.
Thank you @ryantwolf! For the time being I'll try to use make_data_shards
and wait for the dev build next week!
Should be working now. Let us know if you run into this issue again.