Please replace $SCRATCH
in all scripts with the actual path to the_stack_tfds
on your machine
Dependencies: git-lfs (to git clone large files in hf datasets), tfds-nightly, zstandard, fastparquet
Run the download_scripts\get_the_stack_dedup.sh
. the-stack-dedup repo has a Terms of Use so you cannot clone it directly. Please agree with the repo's Terms of Use first on huggingface and enter your huggingface username and access token in git clone command like this.
git clone https://YOUR_HF_USERNAME:YOUR_HF_ACCESS_TOKEN@huggingface.co/datasets/bigcode/the-stack-dedup
Run build_scripts/generate_the_stack_dedup.sh
to generate TFDS in the the_stack_data
directory.
Among the script:
--manual_dir: The source directory for storing raw data.
--data_dir: The target directory for storing the generated TFDS.
Install gsutil and sign in your Google Account, run the_stack_data/upload.sh
to upload TFDS to your google storage bucket.
ref