ChenghaoMou/text-dedup

how to deduplicate already save_to_disk dataset ?

StephennFernandes opened this issue · 2 comments

Hey @ChenghaoMou , thanks for building such an amazing tool.

could you please link to your documentation of text-dedup.

I am looking for a way to deduplicate an already save_to_disk dataset that i am laoding using load_from_disk i see there is some flag --local to be used. could not really make this work

There used to be documentation for older versions. But it is less useful now because it is primarily cli-focused. You can use --help to see all the parameters each script supports.

The best course of action I would recommend is to 1) clone this repo 2) modify the code to your needs. In this case, you can simply change the load_dataset function call in the script you want to use to load_from_disk to load your local data.

I am closing this due to inactivity. Feel free to reopen if you have additional questions.