[BUG] OOM errors while running Deduplication
vipulraheja opened this issue · 1 comments
Describe the bug
While trying to run the demo scripts using Common Crawl (CC-MAIN-2020-50) through the deduplication steps, the gpu_compute_minhashes
and minhash_buckets
steps are crashing non-deterministically where the workers fail to start. I have confirmed that the GPUs were idle before launching these steps. Specifically, I observe that there are multiple processes that get spawned on one GPU, which eventually goes OOM, while others stay idle.
Here is the error log (for the gpu_compute_minhashes
stage. The minhash_buckets
fails similarly.):
(venv) vipul.raheja@vr-lm-1:~/NeMo-Curator$ gpu_compute_minhashes --input-data-dirs /data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_id --output-minhash-dir /data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_minhash --input-json-text-field text --input-json-id-field adlr_id --minhash-length 256 --char-ngram 5 --hash-bytes 4 --seed 42 --log-dir ./log/
2024-05-08 18:47:16,603 - distributed.worker - ERROR - std::bad_alloc: out_of_memory: CUDA error at: /__w/rmm/rmm/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:112: cudaErrorMemoryAllocation out of memory
Traceback (most recent call last):
File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
return await func(*args, **kwargs)
File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
result = plugin.setup(worker=self)
File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 64, in setup
mr = rmm.mr.CudaAsyncMemoryResource(
File "memory_resource.pyx", line 338, in rmm._lib.memory_resource.CudaAsyncMemoryResource.__cinit__
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /__w/rmm/rmm/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:112: cudaErrorMemoryAllocation out of memory
2024-05-08 18:47:16,606 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/usr/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe
raise plugins_exceptions[0]
File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
return await func(*args, **kwargs)
File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
result = plugin.setup(worker=self)
File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 64, in setup
mr = rmm.mr.CudaAsyncMemoryResource(
File "memory_resource.pyx", line 338, in rmm._lib.memory_resource.CudaAsyncMemoryResource.__cinit__
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /__w/rmm/rmm/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:112: cudaErrorMemoryAllocation out of memory
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/nanny.py", line 967, in run
async with worker:
File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__
await self
File "/home/vipul.raheja/NeMo-Curator/venv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
Steps/Code to reproduce bug
I am trying to follow the commands in the READMEs for all the stages. Here is the full sequence of commands I ran:
python3.10 nemo_curator/scripts/get_common_crawl_urls.py \
--starting-snapshot="2020-50" \
--ending-snapshot="2020-50" \
--output-warc-url-file=./cc_urls/warc_urls_cc_2020_50.txt
download_and_extract \
--input-url-file=./cc_urls/warc_urls_cc_2020_50.txt \
--builder-config-file=./config/cc_warc_builder.yaml \
--output-json-dir=/data/commoncrawl/CC-MAIN-2020-50/json
filter_documents
--filter-config-file=./config/fasttext_langid.yaml \
--log-scores \
--log-dir=./log/lang_id \
--input-data-dir=/data/commoncrawl/CC-MAIN-2020-50/jsonl
separate_by_metadata \
--input-data-dir=/data/commoncrawl/CC-MAIN-2020-50/jsonl \
--input-metadata-field=language \
--output-data-dir=/data/commoncrawl/CC-MAIN-2020-50/langs \
--output-metadata-distribution=./data/lang_distro.json
text_cleaning \
--input-data-dir=/data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH \
--output-clean-dir=/data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned
add_id \
--input-data-dir=/data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned \
--output-data-dir=/data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_id
gpu_compute_minhashes \
--input-data-dirs /data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_id \
--output-minhash-dir /data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_minhash \
--input-json-text-field text \
--input-json-id-field adlr_id \
--minhash-length 256 \
--char-ngram 5 \
--hash-bytes 4 \
--seed 42 \
--log-dir ./
minhash_buckets
--input-data-dirs /data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_minhash
--output-bucket-dir /data/commoncrawl/CC-MAIN-2020-50/langs/ENGLISH_cleaned_minhash_bucket
--buckets-per-shuffle 2
--input-json-id-field adlr_id
--log-dir ./log/
Expected behavior
In the gpu_compute_minhashes
stage: The job completes without any errors. Parquet files are created.
In the minhash_buckets
stage: The job completes without any errors. Parquet files (buckets) are created.
Environment overview (please complete the following information)
- Environment location: AWS (A100 instance -- p4de.24xlarge)
- Method of NeMo-Curator install: From source: Cloned the repo and ran
pip install --extra-index-url https://pypi.nvidia.com ".[cuda12x]"
Environment details
- OS version: Ubuntu 20.04.6 LTS
- Dask version:
dask:2024.1.1
,dask-cuda:24.4.0
,dask-mpi: 2022.4.0
- Python version: 3.10.14
Thanks for raising. The root cause of the issue seems to be coming from the fact that importing spacy before creating a GPU cluster creates a primary cuda context on one of the GPUs that impacts cluster creation later on.
For the time being we've rearranged imports in #61 such that these imports only happen later on , but I've also opened #64 for tracking a longer term solution.