ChenghaoMou/text-dedup

FileNotFoundError: [Errno 2] No such file or directory: 'output/temp_text.txt.part.0-7320579'

listentomi opened this issue · 4 comments

When i wanted to run text deduplication on my own local dataset, i got this error below.

python -m text_dedup.suffix_array --path "csv" --data_files train_enron_emails_dataset.csv --cache_dir './cache' --output 'output' --split 'train' --column 'text' --google_repo_path "deduplicate-text-datasets" --no-use_auth_token --local
Dataset({
features: ['file', 'text'],
num_rows: 10348
})
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 0 --end-byte 7320579
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 7220579 --end-byte 14541158
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 14441158 --end-byte 21761737
/bin/sh: 1: ./target/debug/dedup_dataset: not found
./target/debug/dedup_dataset make-part --data-file output/temp_text.txt --start-byte 21661737 --end-byte 28882318
/bin/sh: 1: ./target/debug/dedup_dataset: not found
Waiting for jobs to finish
/bin/sh: 1: ./target/debug/dedup_dataset: not found
/bin/sh: 1: ./target/debug/dedup_dataset: not found
Checking all wrote correctly
Traceback (most recent call last):
File "/root/text-dedup-main/deduplicate-text-datasets/scripts/make_suffix_array.py", line 66, in
size_data = os.path.getsize(x)
File "/root/miniconda3/lib/python3.10/genericpath.py", line 50, in getsize
return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: 'output/temp_text.txt.part.0-7320579'
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/text-dedup-main/text_dedup/suffix_array.py", line 299, in
with timer("Total"):
File "/root/text-dedup-main/text_dedup/utils/timer.py", line 19, in exit
raise exc_val
File "/root/text-dedup-main/text_dedup/suffix_array.py", line 324, in
with timer("SuffixArray"):
File "/root/text-dedup-main/text_dedup/utils/timer.py", line 19, in exit
raise exc_val
File "/root/text-dedup-main/text_dedup/suffix_array.py", line 325, in
__run_command(
File "/root/text-dedup-main/text_dedup/suffix_array.py", line 244, in __run_command
raise RuntimeError(f"Command {cmd} failed with code {code}. CWD: {cwd}")
RuntimeError: Command python scripts/make_suffix_array.py output/temp_text.txt failed with code 1. CWD: deduplicate-text-datasets

i cant figure out alone. Can anyone help me solve this problem and is there any tutorial for custom dataset like .txt and .csv?

/bin/sh: 1: ./target/debug/dedup_dataset: not found means that the google repository is not initialized properly. Please follow the instructions at https://github.com/google-research/deduplicate-text-datasets to build all the executables.

You can clone the repo and modify the code to load your own data. You can find solutions here.

Thank a lot! It seems running correctly after entering command cargo build
But I also see the error message like this:
rm: cannot remove 'tmp/out.table.bin.*': No such file or directory
Is that ok? There seems no other fault in the follow running.
I think you are very fimiliar with the repo deduplicate-text-datasets, if possible, I want to consult a question, do you know how to use the function of Finding duplicates between two different documents in that repo? I am new to Rust, and i dont know what format of the data file this cargo run across-similar --data-file-1 [dataset1] --data-file-2 [dataset2] --length-threshold [num_bytes] --cache-dir [where/to/save] --num-threads [N] command supports. Can you teach me? Thank you very much again.

Glad it is working. You can ignore that error.

It is the same data format used in self-similar, or whatever you have used in previous commands. It is much easier to understand the commands if you can read through the code starting from if __name__ == "__main__": in text_dedup/suffix_array.py in this repo.

I might add the support for deduplication between datasets back in the near future.

I got it! I will try it by myself first. Thanks a lot!!!