How to download the tokenized books3 dataset?

Question

How to download the tokenized books3 dataset?

DarthMurse opened this issue 2 months ago · 1 comments

I am very intrigued by the idea presented in TTT and want to reproduce the training process myself. However I encountered some difficulties when downloading the dataset. I've never used google cloud cli before and when I installed the google cloud sdk package and finish the gcloud init command, I tried to follow the command provided by the author, but it then gave this error:

% gcloud storage cp -r "gs://llama-2-books3/*" llama-2-books3/
Completed files 0 | 0B                                                                                                                                                 
ERROR: (gcloud.storage.cp) HTTPError 400: Bucket is a requester pays bucket but no user project provided.

Then I searched online and add an additional parameter to solve this error, but then it gave:

% gcloud storage cp -r "gs://llama-2-books3/*" llama-2-books3/ --billing-project=murse1017 
Copying gs://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/test.npy to file://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/test.npy
Copying gs://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/tokenizer.pkl to file://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/tokenizer.pkl
Copying gs://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/train.npy to file://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/train.npy
Copying gs://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/validation.npy to file://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/validation.npy
⠧ERROR: HTTPError 403 | 0B/60.7GiB                                                                                                                                       
⠼ Completed files 0/4 | 0B/60.7GiB

And stuck there forever. How can I solve this problem?
(I've searched a lot of solutions on the Internet but still don't know how to solve this. The google cloud stuff really confused me...)

Answer 1 · 2024-07-22T03:50:10.000Z

The command provided works on our end. I would suggest trying to re-login to your Google Cloud account, using gcloud auth login and then setting the project using gcloud config set project my-gcp-project