encounter errors when I try to finetune the model

Question

encounter errors when I try to finetune the model

SleepEarlyLiveLong opened this issue a year ago · 2 comments

I encountered the following problem when finetuning the model with the guidance of README.md.

Here is the detailed error:

(alpaca) root@iZwz95ccn6prjs8ioz8bbdZ:/data/stanford_alpaca# sh order.sh
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/training_args.py:1462: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead
warnings.warn(
/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/training_args.py:1462: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead
warnings.warn(
/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/training_args.py:1462: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead
warnings.warn(
/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/training_args.py:1462: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead
warnings.warn(
Downloading shards: 0%| | 0/33 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/utils/hub.py", line 417, in cached_file
resolved_file = hf_hub_download(
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/huggingface_hub/file_download.py", line 1291, in hf_hub_download
raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/data/stanford_alpaca/train.py", line 222, in
train()
File "/data/stanford_alpaca/train.py", line 186, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
return model_class.from_pretrained(
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2523, in from_pretrained
resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/utils/hub.py", line 934, in get_checkpoint_shard_files
cached_filename = cached_file(
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/utils/hub.py", line 452, in cached_file
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like decapoda-research/llama-7b-hf is not the path to a directory containing a file named pytorch_model-00001-of-00033.bin.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
Downloading shards: 0%| | 0/33 [00:00<?, ?it/sWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25946 closing signal SIGTERM1.5M/405M [00:03<00:38, 9.72MB/s]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25948 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25949 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 25947) of binary: /data/miniconda3/envs/alpaca/bin/python
Traceback (most recent call last):
File "/data/miniconda3/envs/alpaca/bin/torchrun", line 8, in
sys.exit(main())
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-06-05_21:03:54
host : iZwz95ccn6prjs8ioz8bbdZ
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 25947)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

here is the order:

torchrun --nproc_per_node=4 --master_port=7788 train.py
--model_name_or_path decapoda-research/llama-7b-hf
--data_path ./alpaca_data.json
--bf16 True
--output_dir ./output
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--tf32 True

here are some details of my machine:

(alpaca) root@iZwz95ccn6prjs8ioz8bbdZ:/data/stanford_alpaca# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

(alpaca) root@iZwz95ccn6prjs8ioz8bbdZ:/data/stanford_alpaca# conda list

packages in environment at /data/miniconda3/envs/alpaca:

And what is the problem of that bug? How can I fix it? THANKS A LOT!!

Answer 1 · 2023-06-07T12:46:43.000Z

I meet the same error

Answer 2 · 2023-06-09T03:40:54.000Z

I meet the same error

I solved the problem by updating the python from 3.9 to 3.10