多机多卡训练，两机执行到这步后没有后续步骤了 [INFO|trainer.py:641] 2024-07-18 14:06:18,182 >> Using auto half precision backend

Question

多机多卡训练，两机执行到这步后没有后续步骤了 [INFO|trainer.py:641] 2024-07-18 14:06:18,182 >> Using auto half precision backend

cc8476 opened this issue 4 months ago · 2 comments

cc8476 commented 4 months ago

提交前必须检查以下项目

请确保使用的是仓库最新代码（git pull）
已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索，没有找到相似问题和解决方案。
第三方插件问题：例如llama.cpp、text-generation-webui等，建议优先去对应的项目中查找解决方案。

问题类型

模型训练与精调

基础模型

Llama-3-Chinese-8B（基座模型）

操作系统

Linux

详细描述问题

主：
--nnodes 2 --nproc_per_node 1 --master_addr "10.164.1.4" --master_port 14545  --node_rank 0  
从：
--nnodes 2 --nproc_per_node 1 --master_addr "10.164.1.4" --master_port 14545  --node_rank 1

依赖情况（代码类问题务必提供）

模型：llama3_8b (官方的基座模型)
硬件：
H800 *8
软件：
torch.__version__
2.3.1
torch.version.cuda
12.1

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

另外
NCCL
nvidia-fabricmanager
都已经装好且正常运行

运行日志或截图

[INFO|modeling_utils.py:4288] 2024-07-18 14:06:17,929 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /mnt/hpfs/models/hub/LLM-Research/Meta-Llama-3-8B/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:915] 2024-07-18 14:06:17,932 >> loading configuration file /mnt/hpfs/models/hub/LLM-Research/Meta-Llama-3-8B/generation_config.json
[INFO|configuration_utils.py:962] 2024-07-18 14:06:17,932 >> Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": 128001,
  "max_length": 4096,
  "temperature": 0.6,
  "top_p": 0.9
}

07/18/2024 14:06:18 - INFO - __main__ - Model vocab size: 128256
07/18/2024 14:06:18 - INFO - __main__ - Tokenizer vocab size: 128256
07/18/2024 14:06:18 - INFO - __main__ - Init new peft model
07/18/2024 14:06:18 - INFO - __main__ - target_modules: ['q_proj', 'v_proj']
07/18/2024 14:06:18 - INFO - __main__ - lora_rank: 16
07/18/2024 14:06:18 - INFO - __main__ - modules_to_save: ['None']
trainable params: 6,815,744 || all params: 8,037,076,992 || trainable%: 0.0848
[INFO|trainer.py:641] 2024-07-18 14:06:18,182 >> Using auto half precision backend

Answer 1 · 2024-08-01T22:05:41.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

Answer 2 · 2024-08-08T22:05:48.000Z

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.