FileNotFoundError when run `Finding clusters` step.
Ox0400 opened this issue · 2 comments
Crashing when run Finding clusters
step.
I found the issue case by new_fingerprint=str(random.getrandbits(128))
Iterating MinHashes...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 392/392 [06:50<00:00, 1.05s/it]
Clustering...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:28<00:00, 1.14s/it]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1328, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3507, in _map_single
if update_data:
FileNotFoundError: [Errno 2] No such file or directory: '/data/sample-txt/cache-185154854570279270480969661689785176288_00010_of_00048.arrow'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/var/work/minhash.py", line 192, in <module>
with timer("Total"):
File "/usr/local/lib/python3.10/site-packages/text_dedup/utils/timer.py", line 19, in __exit__
raise exc_val
File "/var/work/minhash.py", line 266, in <module>
with timer("Filtering"):
File "/usr/local/lib/python3.10/site-packages/text_dedup/utils/timer.py", line 19, in __exit__
raise exc_val
File "/var/work/minhash.py", line 270, in <module>
ds = ds.map(
File "/usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 580, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 545, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3180, in map
for rank, done, content in iflatmap_unordered(
File "/usr/local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1354, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File "/usr/local/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1354, in <listcomp>
[async_result.get(timeout=0.05) for async_result in async_results]
File "/usr/local/lib/python3.10/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
FileNotFoundError: [Errno 2] No such file or directory: '/data/sample-txt/cache-185154854570279270480969661689785176288_00010_of_00048.arrow'
Thanks for reporting!
TLDR: it is added here to prevent caching for speed performance.
Not sure if you changed anything for this map settings, but the new and random fingerprint is added manually here for a reason: when processing a large-scale dataset, fingerprinting the union-find object (it is pickling the object behind the scene) will take a really long time, which will make the time to complete this map function call much longer than redo the calculation from the previous step.
If there was no change to the map function and the error still persists, do let me know.
Let me know if the last message did not answer your question. I will close this issue for now.