/the_stack_tfds

Scripts to generate TFDS format of the-stack-dedup dataset.

Primary LanguagePython

Please replace $SCRATCH in all scripts with the actual path to the_stack_tfds on your machine
Dependencies: git-lfs (to git clone large files in hf datasets), tfds-nightly, zstandard, fastparquet

1. Download the-stack-dedup

Run the download_scripts\get_the_stack_dedup.sh. the-stack-dedup repo has a Terms of Use so you cannot clone it directly. Please agree with the repo's Terms of Use first on huggingface and enter your huggingface username and access token in git clone command like this.
git clone https://YOUR_HF_USERNAME:YOUR_HF_ACCESS_TOKEN@huggingface.co/datasets/bigcode/the-stack-dedup

2. Generate TFDS of the-stack-dedup

Run build_scripts/generate_the_stack_dedup.sh to generate TFDS in the the_stack_data directory.
Among the script:

--manual_dir: The source directory for storing raw data.
--data_dir: The target directory for storing the generated TFDS.

3. Upload the TFDS to Google Cloud

Install gsutil and sign in your Google Account, run the_stack_data/upload.sh to upload TFDS to your google storage bucket.
ref