卡住了

Question

卡住了

zyxcambridge opened this issue a year ago · 0 comments

KeyboardInterrupt
^CWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1383 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1384 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1387 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388 closing signal SIGTERM
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
result = self._invoke_run(role)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 877, in _invoke_run
time.sleep(monitor_interval)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1313 got signal: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.1.0a0+4136153', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
result = agent.run()
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 743, in run
self._shutdown(e.sigval)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 289, in _shutdown
self._pcontext.close(death_sig)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 331, in close
self._close(death_sig=death_sig, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 713, in _close
handler.proc.wait(time_to_wait)
File "/usr/lib/python3.10/subprocess.py", line 1207, in wait
return self._wait(timeout=timeout)
File "/usr/lib/python3.10/subprocess.py", line 1935, in _wait
time.sleep(delay)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 1313 got signal: 2
^C
root@05df7bede17a:/mnt/update/LLM/Chinese-Llama-2-7b# torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8
--master_port=25003
train.py
--data_path ${DATASET}
--data_cache_path ${DATA_CACHE_PATH}
--bf16 True
--output_dir ${output_dir}
--num_train_epochs 1
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 1
--evaluation_strategy 'no'
--save_strategy 'steps'
--save_steps 1200
--save_total_limit 5
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type cosine
--logging_steps 1
--fsdp 'full_shard auto_wrap'
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--tf32 True
--model_max_length 4096
--gradient_checkpointing True
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Traceback (most recent call last):
File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 284, in
Traceback (most recent call last):
File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 284, in
Traceback (most recent call last):
File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 284, in
Traceback (most recent call last):
File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 284, in
train()
File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 250, in train
train()
File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 250, in train
train()
File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 250, in train
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/usr/local/lib/python3.10/dist-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
train()
File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 250, in train
obj = dtype(**inputs)
File "", line 114, in init
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/usr/local/lib/python3.10/dist-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1372, in post_init
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1516: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead
warnings.warn(
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/usr/local/lib/python3.10/dist-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
Traceback (most recent call last):
obj = dtype(**inputs)
File "", line 114, in init
File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 284, in
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1372, in post_init
obj = dtype(**inputs)Traceback (most recent call last):

File "", line 114, in init
File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 284, in
and (self.device.type != "cuda")model_args, data_args, training_args = parser.parse_args_into_dataclasses()

File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1795, in device
File "/usr/local/lib/python3.10/dist-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1372, in post_init
obj = dtype(**inputs)
File "", line 114, in init
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1372, in post_init
return self._setup_devices
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 54, in get
and (self.device.type != "cuda")
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1795, in device
cached = self.fget(obj)
train() File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1739, in _setup_devices

  File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 250, in train

and (self.device.type != "cuda")
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1795, in device
train()
File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 250, in train
and (self.device.type != "cuda")
self.distributed_state = PartialState(
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1795, in device
File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 198, in init
torch.cuda.set_device(self.device)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 367, in set_device
return self._setup_devices
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 54, in get
return self._setup_devices
torch._C._cuda_setDevice(device) File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 54, in get

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

cached = self.fget(obj)

File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1739, in _setup_devices
cached = self.fget(obj)
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1739, in _setup_devices
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/usr/local/lib/python3.10/dist-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/usr/local/lib/python3.10/dist-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
self.distributed_state = PartialState(
File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 198, in init
obj = dtype(**inputs)
File "", line 114, in init
self.distributed_state = PartialState(
File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 198, in init
torch.cuda.set_device(self.device)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 367, in set_device
torch.cuda.set_device(self.device)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 367, in set_device
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1372, in post_init
return self._setup_devicesobj = dtype(**inputs)

File "", line 114, in init
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 54, in get
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

torch._C._cuda_setDevice(device)

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1372, in post_init
cached = self.fget(obj)
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1739, in _setup_devices
and (self.device.type != "cuda")
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1795, in device
and (self.device.type != "cuda")
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1795, in device
self.distributed_state = PartialState(
File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 198, in init
torch.cuda.set_device(self.device)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 367, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

return self._setup_devices

File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 54, in get
return self._setup_devices
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 54, in get
cached = self.fget(obj)
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1739, in _setup_devices
cached = self.fget(obj)
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1739, in _setup_devices
self.distributed_state = PartialState(
File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 198, in init
self.distributed_state = PartialState(
File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 198, in init
torch.cuda.set_device(self.device)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 367, in set_device
torch.cuda.set_device(self.device)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 367, in set_device
torch._C._cuda_setDevice(device)
torch._C._cuda_setDevice(device)
RuntimeErrorRuntimeError: : CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1516: FutureWarning: using --fsdp_transformer_layer_cls_to_wrap is deprecated. Use fsdp_config instead
warnings.warn(
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 417, in cached_file
resolved_file = hf_hub_download(
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1291, in hf_hub_download
raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Connection error, and we cannot find the requested files in the disk cache. Please try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 284, in
train()
File "/mnt/update/LLM/Chinese-Llama-2-7b/train.py", line 252, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 461, in from_pretrained
config, kwargs = AutoConfig.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 983, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 617, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 672, in _get_config_dict
resolved_config_file = cached_file(
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 452, in cached_file
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like facebook/opt-125m is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
Downloading (…)lve/main/config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 2.79MB/s]
Downloading pytorch_model.bin: 0%| | 0.00/251M [00:00<?, ?B/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1487 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 1488) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.1.0a0+4136153', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2023-07-26_08:31:56
host : 05df7bede17a
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 1489)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-07-26_08:31:56
host : 05df7bede17a
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 1490)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2023-07-26_08:31:56
host : 05df7bede17a
rank : 4 (local_rank: 4)
exitcode : 1 (pid: 1491)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
time : 2023-07-26_08:31:56
host : 05df7bede17a
rank : 5 (local_rank: 5)
exitcode : 1 (pid: 1492)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
time : 2023-07-26_08:31:56
host : 05df7bede17a
rank : 6 (local_rank: 6)
exitcode : 1 (pid: 1493)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
time : 2023-07-26_08:31:56
host : 05df7bede17a
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 1494)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-07-26_08:31:56
host : 05df7bede17a
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1488)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html