ChenghaoMou/text-dedup

suffix array "No such file or directory" exception

yusufcakmakk opened this issue · 2 comments

Hi,

I read the RefinedWeb dataset article from Falcon model and I want to follow same approach. To do this, I preprocessed my data with some rules and after that I used MinHash Near Deduplication in your repo. It reduced dataset like this:

[07/15/23 13:45:54] INFO     Loading                         : 496.09s                                        minhash.py:330
                    INFO     MinHashing                      : 800.25s                                        minhash.py:330
                    INFO     Clustering                      : 554.21s                                        minhash.py:330
                    INFO     Filtering                       : 387.45s                                        minhash.py:330
                    INFO     Saving                          : 50.34s                                         minhash.py:330
                    INFO     Total                           : 2288.34s                                       minhash.py:330
                    INFO     Before                          : 6316662                                        minhash.py:332
                    INFO     After                           : 4530959                                        minhash.py:333

As suggested in RefinedWeb dataset article, next step is exact deduplication with suffix arrays. To that I want to use Suffix Array Substring Exact Deduplication with the following command:

python -m text_dedup.suffix_array \
    --path "output/minhash/oscar_tr_dedup" \
    --local \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/suffix_array/oscar_tr_dedup" \
    --column "filtered_content" \
    --google_repo_path "/data/dedup_for_llm/text-dedup/deduplicate-text-datasets"

Note: I modified the code in here with local option like here

After running the text_dedup.suffix_array command, process throw the exception below:

[07/17/23 12:05:51] INFO     Loading dataset...                                                          suffix_array.py:305
[07/17/23 12:05:52] INFO     Loading dataset... Done                                                     suffix_array.py:319
                    INFO     Started to preprocessing...                                                 suffix_array.py:322
[07/17/23 12:11:31] INFO     Started to preprocessing... Done                                            suffix_array.py:332
                    INFO     Started to suffix array...                                                  suffix_array.py:335
Data size: 19152440938
Total jobs: 100
Jobs at once: 20
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 0 --end-byte 191624409
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 191524409 --end-byte 383148818
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 383048818 --end-byte 574673227
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 574573227 --end-byte 766197636
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 766097636 --end-byte 957722045
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 957622045 --end-byte 1149246454
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 1149146454 --end-byte 1340770863
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 1340670863 --end-byte 1532295272
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 1532195272 --end-byte 1723819681
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 1723719681 --end-byte 1915344090
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 1915244090 --end-byte 2106868499
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 2106768499 --end-byte 2298392908
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 2298292908 --end-byte 2489917317
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 2489817317 --end-byte 2681441726
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 2681341726 --end-byte 2872966135
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 2872866135 --end-byte 3064490544
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 3064390544 --end-byte 3256014953
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 3255914953 --end-byte 3447539362
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 3447439362 --end-byte 3639063771
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 3638963771 --end-byte 3830588180
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
Waiting for jobs to finish
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 3830488180 --end-byte 4022112589
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 4022012589 --end-byte 4213636998
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 4213536998 --end-byte 4405161407
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 4405061407 --end-byte 4596685816
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 4596585816 --end-byte 4788210225
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 4788110225 --end-byte 4979734634
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 4979634634 --end-byte 5171259043
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 5171159043 --end-byte 5362783452
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 5362683452 --end-byte 5554307861
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 5554207861 --end-byte 5745832270
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 5745732270 --end-byte 5937356679
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 5937256679 --end-byte 6128881088
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 6128781088 --end-byte 6320405497
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 6320305497 --end-byte 6511929906
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 6511829906 --end-byte 6703454315
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 6703354315 --end-byte 6894978724
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 6894878724 --end-byte 7086503133
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 7086403133 --end-byte 7278027542
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 7277927542 --end-byte 7469551951
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 7469451951 --end-byte 7661076360
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
Waiting for jobs to finish
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 7660976360 --end-byte 7852600769
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 7852500769 --end-byte 8044125178
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 8044025178 --end-byte 8235649587
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 8235549587 --end-byte 8427173996
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 8427073996 --end-byte 8618698405
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 8618598405 --end-byte 8810222814
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 8810122814 --end-byte 9001747223
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 9001647223 --end-byte 9193271632
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 9193171632 --end-byte 9384796041
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 9384696041 --end-byte 9576320450
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 9576220450 --end-byte 9767844859
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 9767744859 --end-byte 9959369268
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 9959269268 --end-byte 10150893677
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 10150793677 --end-byte 10342418086
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 10342318086 --end-byte 10533942495
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 10533842495 --end-byte 10725466904
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 10725366904 --end-byte 10916991313
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 10916891313 --end-byte 11108515722
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 11108415722 --end-byte 11300040131
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 11299940131 --end-byte 11491564540
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
Waiting for jobs to finish
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 11491464540 --end-byte 11683088949
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 11682988949 --end-byte 11874613358
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 11874513358 --end-byte 12066137767
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 12066037767 --end-byte 12257662176
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 12257562176 --end-byte 12449186585
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 12449086585 --end-byte 12640710994
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 12640610994 --end-byte 12832235403
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 12832135403 --end-byte 13023759812
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 13023659812 --end-byte 13215284221
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 13215184221 --end-byte 13406808630
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 13406708630 --end-byte 13598333039
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 13598233039 --end-byte 13789857448
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 13789757448 --end-byte 13981381857
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 13981281857 --end-byte 14172906266
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 14172806266 --end-byte 14364430675
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 14364330675 --end-byte 14555955084
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 14555855084 --end-byte 14747479493
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 14747379493 --end-byte 14939003902
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 14938903902 --end-byte 15130528311
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 15130428311 --end-byte 15322052720
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
Waiting for jobs to finish
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 15321952720 --end-byte 15513577129
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 15513477129 --end-byte 15705101538
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 15705001538 --end-byte 15896625947
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 15896525947 --end-byte 16088150356
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 16088050356 --end-byte 16279674765
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 16279574765 --end-byte 16471199174
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 16471099174 --end-byte 16662723583
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 16662623583 --end-byte 16854247992
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 16854147992 --end-byte 17045772401
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 17045672401 --end-byte 17237296810
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 17237196810 --end-byte 17428821219
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 17428721219 --end-byte 17620345628
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 17620245628 --end-byte 17811870037
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 17811770037 --end-byte 18003394446
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 18003294446 --end-byte 18194918855
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 18194818855 --end-byte 18386443264
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 18386343264 --end-byte 18577967673
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 18577867673 --end-byte 18769492082
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 18769392082 --end-byte 18961016491
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 18960916491 --end-byte 19152440938
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
Waiting for jobs to finish
/bin/sh: ./target/debug/dedup_dataset: No such file or directory
Checking all wrote correctly
>>>>: output/temp_text.txt.part.0-191624409
Traceback (most recent call last):
  File "/data/dedup_for_llm/text-dedup/deduplicate-text-datasets/scripts/make_suffix_array.py", line 71, in <module>
    size_data = os.path.getsize(x)
  File "/data/miniconda3/lib/python3.9/genericpath.py", line 50, in getsize
    return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: 'output/temp_text.txt.part.0-191624409'
Traceback (most recent call last):
  File "/data/miniconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/miniconda3/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/dedup_for_llm/text-dedup/text_dedup/suffix_array.py", line 378, in <module>
    ds.save_to_disk(args.output)
  File "/data/dedup_for_llm/text-dedup/text_dedup/utils/timer.py", line 19, in __exit__
    raise exc_val
  File "/data/dedup_for_llm/text-dedup/text_dedup/suffix_array.py", line 340, in <module>
    logger.info("Started to suffix array... Done")
  File "/data/dedup_for_llm/text-dedup/text_dedup/utils/timer.py", line 19, in __exit__
    raise exc_val
  File "/data/dedup_for_llm/text-dedup/text_dedup/suffix_array.py", line 336, in <module>
    __run_command(
  File "/data/dedup_for_llm/text-dedup/text_dedup/suffix_array.py", line 247, in __run_command
    raise RuntimeError(f"Command {cmd} failed with code {code}. CWD: {cwd}")
RuntimeError: Command python scripts/make_suffix_array.py output/temp_text.txt failed with code 1. CWD: /data/dedup_for_llm/text-dedup/deduplicate-text-datasets

Based on the error it couldn't find the file "FileNotFoundError: [Errno 2] No such file or directory: 'output/temp_text.txt.part.0-191624409'" I checked several times but no other files were created except "output/temp.txt"

I couldn't figure out the solution. Could you help me?

I think I forgot the run cargo build on google repo. I will try it and write here my result.

I think I forgot the run cargo build on google repo. I will try it and write here my result.

Yes, I solved it with cargo build. :)