ChenghaoMou/text-dedup

FileNotFoundError when run `Finding clusters` step.

Ox0400 opened this issue · 2 comments

Ox0400 commented

Crashing when run Finding clusters step.

I found the issue case by new_fingerprint=str(random.getrandbits(128))

Iterating MinHashes...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 392/392 [06:50<00:00,  1.05s/it]
Clustering...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:28<00:00,  1.14s/it]
multiprocess.pool.RemoteTraceback:                                                                                                                                                                 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1328, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3507, in _map_single
    if update_data:
FileNotFoundError: [Errno 2] No such file or directory: '/data/sample-txt/cache-185154854570279270480969661689785176288_00010_of_00048.arrow'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/work/minhash.py", line 192, in <module>
    with timer("Total"):
  File "/usr/local/lib/python3.10/site-packages/text_dedup/utils/timer.py", line 19, in __exit__
    raise exc_val
  File "/var/work/minhash.py", line 266, in <module>
    with timer("Filtering"):
  File "/usr/local/lib/python3.10/site-packages/text_dedup/utils/timer.py", line 19, in __exit__
    raise exc_val
  File "/var/work/minhash.py", line 270, in <module>
    ds = ds.map(
  File "/usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 580, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 545, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3180, in map
    for rank, done, content in iflatmap_unordered(
  File "/usr/local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1354, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/usr/local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1354, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
FileNotFoundError: [Errno 2] No such file or directory: '/data/sample-txt/cache-185154854570279270480969661689785176288_00010_of_00048.arrow'

Thanks for reporting!

TLDR: it is added here to prevent caching for speed performance.

Not sure if you changed anything for this map settings, but the new and random fingerprint is added manually here for a reason: when processing a large-scale dataset, fingerprinting the union-find object (it is pickling the object behind the scene) will take a really long time, which will make the time to complete this map function call much longer than redo the calculation from the previous step.

If there was no change to the map function and the error still persists, do let me know.

Let me know if the last message did not answer your question. I will close this issue for now.