How to download the tokenized books3 dataset?
DarthMurse opened this issue · 1 comments
I am very intrigued by the idea presented in TTT and want to reproduce the training process myself. However I encountered some difficulties when downloading the dataset. I've never used google cloud cli before and when I installed the google cloud sdk package and finish the gcloud init
command, I tried to follow the command provided by the author, but it then gave this error:
% gcloud storage cp -r "gs://llama-2-books3/*" llama-2-books3/
Completed files 0 | 0B
ERROR: (gcloud.storage.cp) HTTPError 400: Bucket is a requester pays bucket but no user project provided.
Then I searched online and add an additional parameter to solve this error, but then it gave:
% gcloud storage cp -r "gs://llama-2-books3/*" llama-2-books3/ --billing-project=murse1017
Copying gs://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/test.npy to file://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/test.npy
Copying gs://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/tokenizer.pkl to file://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/tokenizer.pkl
Copying gs://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/train.npy to file://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/train.npy
Copying gs://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/validation.npy to file://llama-2-books3/tokenizer_name-meta-llama/Llama-2-7b-hf-val_ratio-0.0005-val_split_seed-2357-add_eos-True-detokenize-False/validation.npy
⠧ERROR: HTTPError 403 | 0B/60.7GiB
⠼ Completed files 0/4 | 0B/60.7GiB
And stuck there forever. How can I solve this problem?
(I've searched a lot of solutions on the Internet but still don't know how to solve this. The google cloud stuff really confused me...)
The command provided works on our end. I would suggest trying to re-login to your Google Cloud account, using gcloud auth login
and then setting the project using gcloud config set project my-gcp-project