NVIDIA/NeMo-Curator

Given JSON input is too large when using "cudf" backend

Closed this issue · 3 comments

Describe the bug

I'm trying to load some data from jsonl files using the cudf backend but I get an error of "ValueError: Metadata inference failed in read_single_partition." and "RuntimeError('CUDF failure at: /__w/cudf/cudf/cpp/src/io/json/nested_json_gpu.cu:86: Given JSON input is too large')".

Steps/Code to reproduce bug

files = get_all_files_paths_under("/path/data") # ['/path/data/file1.jsonl', ... , /path/data/fileN.jsonl']
meta = {
    'field1': np.dtype('int64'),
    'fieldN': np.dtype('O')
}

dataset = DocumentDataset.read_json(input_files=files, backend="cudf", add_filename=True, input_meta=meta) 

Expected behavior

I expect the data to load as well as when using pandas backend but "faster" and making use of GPUs parallelization.

Environment overview (please complete the following information)

  • Environment location: enroot + pyxis on slurm cluster
  • Method of NeMo-Curator install: NeMo Framework docker image, 24.05llama3.1, NeMo Curator 0.4.0

Additional logs

ValueError: Metadata inference failed in `read_single_partition`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
RuntimeError('CUDF failure at: /__w/cudf/cudf/cpp/src/io/json/nested_json_gpu.cu:86: Given JSON input is too large')

Traceback:
---------
  File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/utils.py", line 195, in raise_on_meta_error
    yield
  File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/core.py", line 7175, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/usr/local/lib/python3.10/dist-packages/nemo_curator/utils/distributed_utils.py", line 267, in read_single_partition
    df = read_f(file, **read_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/cudf/io/json.py", line 96, in read_json
    df = libjson.read_json(
  File "json.pyx", line 45, in cudf._lib.json.read_json
  File "json.pyx", line 137, in cudf._lib.json.read_json

Paraphrasing some points after discussing with @ayushdg. cuDF had an issue where it could not read individual jsonl files larger than 2.1 GB . See: rapidsai/cudf#16138.
cuDF fixed this in: rapidsai/cudf#16162 that should be a part of the release next week. Couple of steps:

  1. For the time being we recommend splitting the large file into smaller chunks using NeMo Curator's make_data_shards functionality.
  2. We'll test if the cuDF fix works as expected & suggest using the latest nightly containers once a new version of cuDF is out in a week. We'll update this issue then.

Thank you @ryantwolf! For the time being I'll try to use make_data_shards and wait for the dev build next week!

Should be working now. Let us know if you run into this issue again.