find_pii_and_deidentify example fails
randerzander opened this issue · 1 comments
randerzander commented
I'm trying to run the PII example here.
# for gpu
python /repos/NeMo-Curator/examples/find_pii_and_deidentify.py --device gpu
# for cpu
python /repos/NeMo-Curator/examples/find_pii_and_deidentify.py
On CPU, I get memory warnings and eventual worker deaths without producing output:
2024-05-28 14:41:18,511 - distributed.nanny - WARNING - Restarting worker [180/2695]
2024-05-28 14:41:19 INFO:Loaded recognizer: EmailRecognizer
2024-05-28 14:41:19 INFO:Loaded recognizer: PhoneRecognizer
2024-05-28 14:41:19 INFO:Loaded recognizer: SpacyRecognizer
2024-05-28 14:41:19 INFO:Loaded recognizer: UsSsnRecognizer
2024-05-28 14:41:19 INFO:Loaded recognizer: CreditCardRecognizer
2024-05-28 14:41:19 INFO:Loaded recognizer: IpRecognizer
2024-05-28 14:41:19 WARNING:model_to_presidio_entity_mapping is missing from configuration, using default
2024-05-28 14:41:19 WARNING:low_score_entity_names is missing from configuration, using default
2024-05-28 14:41:22,407 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may in
dicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/lat
est/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 3.68
GiB -- Worker memory limit: 5.25 GiB
2024-05-28 14:41:23,165 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing wo
rker. Process memory: 4.27 GiB -- Worker memory limit: 5.25 GiB
2024-05-28 14:41:24,134 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:33953 (pid=14243) e
xceeded 95% memory budget. Restarting...
2024-05-28 14:41:24,471 - distributed.scheduler - ERROR - Task ('getitem-modify_document-assign-64f0e480e
2b64dd94f34c05c2de0918e', 0) marked as failed because 4 workers died while trying to run it
2024-05-28 14:41:24,472 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:33953' cause
d the cluster to lose already computed task(s), which will be recomputed elsewhere: {('frompandas-f7a5910
31e0ada9d2c8cba1c8468dd66', 0)} (stimulus_id='handle-worker-cleanup-1716907284.4715889')
Traceback (most recent call last):
File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 52, in <module>
console_script()
File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 48, in console_script
modified_dataset.df.to_json("output_files/*.jsonl", lines=True, orient="records")
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask_expr/_collection.py", line 2380, in to_j
son
return to_json(self, filename, *args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/dataframe/io/json.py", line 96, in to_js
on
return list(dask_compute(*parts, **compute_kwargs))
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/dask/base.py", line 661, in compute
results = schedule(dsk, keys, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/distributed/client.py", line 2232, in _gather
raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: Attempted to run task ('getitem-modify_document-assign-64f0e480e2b64d
d94f34c05c2de0918e', 0) on 4 different workers, but all those workers died while running it. The last wor
ker that attempt to run the task was tcp://127.0.0.1:33953. Inspecting worker logs is often a good next s
tep to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.h
tml.
2024-05-28 14:41:24,778 - distributed.nanny - WARNING - Restarting worker
2024-05-28 14:41:24,959 - distributed.worker - ERROR - Failed to communicate with scheduler during heartb
eat.
There's a longer trace, but it's just more restarting workers before the cluster shuts down.
In GPU mode, it takes some time before failing with a pytorch error:
python examples/find_pii_and_deidentify.py --device gpu
Traceback (most recent call last):
File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 52, in <module>
console_script()
File "/repos/NeMo-Curator/examples/find_pii_and_deidentify.py", line 30, in console_script
_ = get_client(**parse_client_args(arguments))
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 150, in get_client
return start_dask_gpu_local_cluster(
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 75, in start_dask_gpu_local_cluster
_set_torch_to_use_rmm()
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/utils/distributed_utils.py", line 175, in _set_torch_to_use_rmm
torch.cuda.memory.change_current_allocator(rmm_torch_allocator)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/torch/cuda/memory.py", line 905, in change_current_allocator
torch._C._cuda_changeCurrentAllocator(allocator.allocator())
AttributeError: module 'torch._C' has no attribute '_cuda_changeCurrentAllocator'
ayushdg commented
For the GPU case, the error seems to indicate that torch could not change the cuda allocator.
One of the reasons this can happen when only the CPU flavor of Torch is installed without GPU support.
Is it possible to check if the following command works in the environment:
import torch
torch.cuda.is_available()