huggingface/datasets

`push_to_hub` overwrite argument

ceferisbarov opened this issue · 9 comments

Feature request

Add an overwrite argument to the push_to_hub method.

Motivation

I want to overwrite a repo without deleting it on Hugging Face. Is this possible? I couldn't find anything in the documentation or tutorials.

Your contribution

I can create a PR.

Hi ! Do you mean deleting all the files ? or erasing the repository git history before push_to_hub ?

Hi! I meant the latter.

I don't think there is a huggingface_hub utility to erase the git history, cc @Wauplin maybe ?

What is the goal exactly of deleting all the git history without deleting the repo?

You can use super_squash_commit to squash all the commits into a single one, hence deleting the git history. This is not exactly what you asked for since it squashes the commits for a specific revision (example: "all commits on main"). This means that if other branches exists, they are kept the same. Also if some PRs are already opened on the repo, they will become unmergeable since the commits will have diverted.

So the solution is:

from huggingface_hub import HfApi
repo_id = "username/dataset_name"
ds.push_to_hub(repo_id)
HfApi().super_squash_commit(repo_id)

This way you erase previous git history to end up with only 1 commit containing your dataset.
Still, I'd be curious why it's important in your case. Is it to save storage space ? or to disallow loading old versions of the data ?

Thanks, everyone! I am building a new dataset and playing around with column names, splits, etc. Sometimes I push to the hub to share it with other teammates, I don't want those variations to be part of the repo. Deleting the repo from the website takes a little time, but it also loses repo settings that I have set, since I always set it to public with manually approved requests.

BTW, I had to write HfApi().super_squash_history(repo_id, repo_type="dataset"), but otherwise it works.