tatsu-lab/stanford_alpaca

encounter errors when I try to finetune the model

SleepEarlyLiveLong opened this issue · 2 comments

I encountered the following problem when finetuning the model with the guidance of README.md.

Here is the detailed error:

(alpaca) root@iZwz95ccn6prjs8ioz8bbdZ:/data/stanford_alpaca# sh order.sh
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/training_args.py:1462: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead
warnings.warn(
/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/training_args.py:1462: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead
warnings.warn(
/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/training_args.py:1462: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead
warnings.warn(
/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/training_args.py:1462: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead
warnings.warn(
Downloading shards: 0%| | 0/33 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/utils/hub.py", line 417, in cached_file
resolved_file = hf_hub_download(
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1291, in hf_hub_download
raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/data/stanford_alpaca/train.py", line 222, in
train()
File "/data/stanford_alpaca/train.py", line 186, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
return model_class.from_pretrained(
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2523, in from_pretrained
resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/utils/hub.py", line 934, in get_checkpoint_shard_files
cached_filename = cached_file(
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/utils/hub.py", line 452, in cached_file
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like decapoda-research/llama-7b-hf is not the path to a directory containing a file named pytorch_model-00001-of-00033.bin.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
Downloading shards: 0%| | 0/33 [00:00<?, ?it/sWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25946 closing signal SIGTERM1.5M/405M [00:03<00:38, 9.72MB/s]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25948 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25949 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 25947) of binary: /data/miniconda3/envs/alpaca/bin/python
Traceback (most recent call last):
File "/data/miniconda3/envs/alpaca/bin/torchrun", line 8, in
sys.exit(main())
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-06-05_21:03:54
host : iZwz95ccn6prjs8ioz8bbdZ
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 25947)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

here is the order:

torchrun --nproc_per_node=4 --master_port=7788 train.py
--model_name_or_path decapoda-research/llama-7b-hf
--data_path ./alpaca_data.json
--bf16 True
--output_dir ./output
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--tf32 True

here are some details of my machine:

image

(alpaca) root@iZwz95ccn6prjs8ioz8bbdZ:/data/stanford_alpaca# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

(alpaca) root@iZwz95ccn6prjs8ioz8bbdZ:/data/stanford_alpaca# conda list

packages in environment at /data/miniconda3/envs/alpaca:

Name Version Build Channel

_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 1.4.0 pypi_0 pypi
accelerate 0.19.0 pypi_0 pypi
aiohttp 3.8.4 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
appdirs 1.4.4 pypi_0 pypi
async-timeout 4.0.2 pypi_0 pypi
attrs 23.1.0 pypi_0 pypi
ca-certificates 2023.01.10 h06a4308_0
certifi 2023.5.7 pypi_0 pypi
charset-normalizer 3.1.0 pypi_0 pypi
click 8.1.3 pypi_0 pypi
cmake 3.26.3 pypi_0 pypi
docker-pycreds 0.4.0 pypi_0 pypi
fairscale 0.4.13 pypi_0 pypi
filelock 3.12.0 pypi_0 pypi
fire 0.5.0 pypi_0 pypi
frozenlist 1.3.3 pypi_0 pypi
fsspec 2023.5.0 pypi_0 pypi
gitdb 4.0.10 pypi_0 pypi
gitpython 3.1.31 pypi_0 pypi
huggingface-hub 0.15.1 pypi_0 pypi
idna 3.4 pypi_0 pypi
jinja2 3.1.2 pypi_0 pypi
joblib 1.2.0 pypi_0 pypi
ld_impl_linux-64 2.38 h1181459_1
libffi 3.4.4 h6a678d5_0
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
lit 16.0.5 pypi_0 pypi
llama 0.0.0 dev_0
markupsafe 2.1.2 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
multidict 6.0.4 pypi_0 pypi
ncurses 6.4 h6a678d5_0
networkx 3.1 pypi_0 pypi
nltk 3.8.1 pypi_0 pypi
numpy 1.24.3 pypi_0 pypi
nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi
nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi
nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi
nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi
nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi
nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi
nvidia-curand-cu11 10.2.10.91 pypi_0 pypi
nvidia-cusolver-cu11 11.4.0.1 pypi_0 pypi
nvidia-cusparse-cu11 11.7.4.91 pypi_0 pypi
nvidia-nccl-cu11 2.14.3 pypi_0 pypi
nvidia-nvtx-cu11 11.7.91 pypi_0 pypi
openai 0.27.7 pypi_0 pypi
openssl 1.1.1t h7f8727e_0
packaging 23.1 pypi_0 pypi
pathtools 0.1.2 pypi_0 pypi
pip 23.0.1 py39h06a4308_0
protobuf 4.23.2 pypi_0 pypi
psutil 5.9.5 pypi_0 pypi
python 3.9.16 h7a1cb2a_2
pyyaml 6.0 pypi_0 pypi
readline 8.2 h5eee18b_0
regex 2023.5.5 pypi_0 pypi
requests 2.31.0 pypi_0 pypi
rouge-score 0.1.2 pypi_0 pypi
sentencepiece 0.1.99 pypi_0 pypi
sentry-sdk 1.24.0 pypi_0 pypi
setproctitle 1.3.2 pypi_0 pypi
setuptools 67.8.0 py39h06a4308_0
six 1.16.0 pypi_0 pypi
smmap 5.0.0 pypi_0 pypi
sqlite 3.41.2 h5eee18b_0
sympy 1.12 pypi_0 pypi
termcolor 2.3.0 pypi_0 pypi
tk 8.6.12 h1ccaba5_0
tokenizers 0.13.3 pypi_0 pypi
torch 2.0.1 pypi_0 pypi
tqdm 4.65.0 pypi_0 pypi
transformers 4.29.2 pypi_0 pypi
triton 2.0.0 pypi_0 pypi
typing-extensions 4.6.2 pypi_0 pypi
tzdata 2023c h04d1e81_0
urllib3 1.26.16 pypi_0 pypi
wandb 0.15.3 pypi_0 pypi
wheel 0.38.4 py39h06a4308_0
xz 5.4.2 h5eee18b_0
yarl 1.9.2 pypi_0 pypi
zlib 1.2.13 h5eee18b_0

And what is the problem of that bug? How can I fix it? THANKS A LOT!!

I meet the same error

I meet the same error

I solved the problem by updating the python from 3.9 to 3.10