how to deduplicate already save_to_disk dataset ?
StephennFernandes opened this issue · 2 comments
Hey @ChenghaoMou , thanks for building such an amazing tool.
could you please link to your documentation of text-dedup.
I am looking for a way to deduplicate an already save_to_disk dataset that i am laoding using load_from_disk i see there is some flag --local to be used. could not really make this work
There used to be documentation for older versions. But it is less useful now because it is primarily cli-focused. You can use --help
to see all the parameters each script supports.
The best course of action I would recommend is to 1) clone this repo 2) modify the code to your needs. In this case, you can simply change the load_dataset
function call in the script you want to use to load_from_disk
to load your local data.
I am closing this due to inactivity. Feel free to reopen if you have additional questions.