NVIDIA/NeMo-Curator

[BUG] OOM errors while running Deduplication

vipulraheja opened this issue · 1 comments

Describe the bug
While trying to run the demo scripts using Common Crawl (CC-MAIN-2020-50) through the deduplication steps, the gpu_compute_minhashes and minhash_buckets steps are crashing non-deterministically where the workers fail to start. I have confirmed that the GPUs were idle before launching these steps. Specifically, I observe that there are multiple processes that get spawned on one GPU, which eventually goes OOM, while others stay idle.

Here is the error log (for the gpu_compute_minhashes stage. The minhash_buckets fails similarly.):

(venv) vipul.raheja@vr-lm-1:~/NeMo-Curator$ gpu_compute_minhashes   --input-data-dirs /data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_id   --output-minhash-dir /data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_minhash   --input-json-text-field text   --input-json-id-field adlr_id   --minhash-length 256   --char-ngram 5   --hash-bytes 4   --seed 42   --log-dir ./log/

2024-05-08 18:47:16,603 - distributed.worker - ERROR - std::bad_alloc: out_of_memory: CUDA error at: /__w/rmm/rmm/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:112: cudaErrorMemoryAllocation out of memory
Traceback (most recent call last):
  File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
  File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
    result = plugin.setup(worker=self)
  File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 64, in setup
    mr = rmm.mr.CudaAsyncMemoryResource(
  File "memory_resource.pyx", line 338, in rmm._lib.memory_resource.CudaAsyncMemoryResource.__cinit__
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /__w/rmm/rmm/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:112: cudaErrorMemoryAllocation out of memory
2024-05-08 18:47:16,606 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe
    raise plugins_exceptions[0]
  File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
  File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
    result = plugin.setup(worker=self)
  File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 64, in setup
    mr = rmm.mr.CudaAsyncMemoryResource(
  File "memory_resource.pyx", line 338, in rmm._lib.memory_resource.CudaAsyncMemoryResource.__cinit__
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /__w/rmm/rmm/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:112: cudaErrorMemoryAllocation out of memory

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/nanny.py", line 967, in run
    async with worker:
  File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__
    await self
  File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.

Steps/Code to reproduce bug
I am trying to follow the commands in the READMEs for all the stages. Here is the full sequence of commands I ran:

python3.10 nemo_curator/scripts/get_common_crawl_urls.py \
	--starting-snapshot="2020-50" \
	--ending-snapshot="2020-50" \
	--output-warc-url-file=./cc_urls/warc_urls_cc_2020_50.txt

download_and_extract \
	--input-url-file=./cc_urls/warc_urls_cc_2020_50.txt \
	--builder-config-file=./config/cc_warc_builder.yaml \
        --output-json-dir=/data/commoncrawl/CC-MAIN-2020-50/json

filter_documents 
	--filter-config-file=./config/fasttext_langid.yaml \
	--log-scores \
	--log-dir=./log/lang_id \
	--input-data-dir=/data/commoncrawl/CC-MAIN-2020-50/jsonl 

separate_by_metadata \ 
    --input-data-dir=/data/commoncrawl/CC-MAIN-2020-50/jsonl \
    --input-metadata-field=language \ 
    --output-data-dir=/data/commoncrawl/CC-MAIN-2020-50/langs \
    --output-metadata-distribution=./data/lang_distro.json
    
text_cleaning \
    --input-data-dir=/data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH \
    --output-clean-dir=/data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned

add_id \
    --input-data-dir=/data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned \
    --output-data-dir=/data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_id
   
gpu_compute_minhashes \
  --input-data-dirs /data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_id \
  --output-minhash-dir /data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_minhash \
  --input-json-text-field text \
  --input-json-id-field adlr_id \
  --minhash-length 256 \
  --char-ngram 5 \
  --hash-bytes 4 \
  --seed 42 \
  --log-dir ./
  
minhash_buckets 
	--input-data-dirs /data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_minhash 
	--output-bucket-dir /data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_minhash_bucket 
	--buckets-per-shuffle 2
	--input-json-id-field adlr_id 
	--log-dir ./log/ 

Expected behavior
In the gpu_compute_minhashes stage: The job completes without any errors. Parquet files are created.
In the minhash_buckets stage: The job completes without any errors. Parquet files (buckets) are created.

Environment overview (please complete the following information)

  • Environment location: AWS (A100 instance -- p4de.24xlarge)
  • Method of NeMo-Curator install: From source: Cloned the repo and ran pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"

Environment details

  • OS version: Ubuntu 20.04.6 LTS
  • Dask version: dask:2024.1.1, dask-cuda:24.4.0, dask-mpi: 2022.4.0
  • Python version: 3.10.14

Thanks for raising. The root cause of the issue seems to be coming from the fact that importing spacy before creating a GPU cluster creates a primary cuda context on one of the GPUs that impacts cluster creation later on.

For the time being we've rearranged imports in #61 such that these imports only happen later on , but I've also opened #64 for tracking a longer term solution.