OptimalScale/LMFlow

Fine-Tuning Crashes for no reason when Eight GPU cards are used.

OscarC9912 opened this issue · 4 comments

Dear Developers at LMFlow:

I have been using LMFlow for a long time and the experience is great !

But recently, after cloning the latest LMFlow and use it to Fine-Tune my model, I encountered some expected issue.

Specifically, when I use all 8 of my NVIDIA-A100 cards, the fine-tuning program crashes without indicating any error. However, when I use only 6 / 7 cards, things goes well.

Beliw is the output of the program:

[2024-05-08 14:29:20,246] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,246] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,398] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,399] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,458] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,458] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,499] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,500] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,537] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,537] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,538] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-05-08 14:29:20,593] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,593] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,634] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,634] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,650] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,650] [INFO] [comm.py:616:init_distributed] cdb=None
05/08/2024 14:29:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 5, device: cuda:5, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 6, device: cuda:6, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 4, device: cuda:4, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:22 - WARNING - lmflow.pipeline.finetuner - Process rank: 3, device: cuda:3, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:22 - WARNING - lmflow.pipeline.finetuner - Process rank: 7, device: cuda:7, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:22 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1,distributed training: True, 16-bits training: True
[2024-05-08 14:32:25,933] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906770
[2024-05-08 14:32:25,972] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906771
[2024-05-08 14:32:31,375] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906772
[2024-05-08 14:32:35,169] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906773
[2024-05-08 14:32:38,199] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906774
[2024-05-08 14:32:41,567] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906775
[2024-05-08 14:32:45,223] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906776
[2024-05-08 14:32:48,678] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906777

[2024-05-08 14:32:53,189] [ERROR] [launch.py:321:sigkill_handler] ['miniconda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=7', '--model_name_or_path', 'meta-llama/Meta-Llama-3-8B', '--dataset_path', '/data', '--output_dir', '/model', '--overwrite_output_dir', '--num_train_epochs', '1', '--learning_rate', '1e-5', '--block_size', '512', '--per_device_train_batch_size', '32', '--deepspeed', 'configs/ds_config_zero3.json', '--fp16', '--run_name', 'chinese-llama3', '--validation_split_percentage', '20', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--use_flash_attention', 'True', '--dataloader_num_workers', '8'] exits with return code = -9

I am pretty sure I use the correct way to specify the GPU cards to use by setting DeepSpeed Arguments:
deepspeed_args="--master_port=11012 --include localhost:0,1,2,3,4,5,6,7"

As I have never excountered this problem with the elder version, after several times of experiment with errors, I come here to seek for helps.
I am not sure if it is the problem on the LMFlow side / on my side.

Thanks for your help ~

Thanks for your interest and recognition in LMFlow! Some of our collaborators have met a similar issue. We were using CUDA 12.0 and pytorch for cuda 12.1, and similar problems occurred. It was resolved by using pytorch corresponding to an older CUDA (like 11.8).

We suspect this problem is caused by the mismatch of the latest pytorch version and CUDA version. You may try to adjust the versions of pytorch to see if the problem occurs again. Hope this information can be helpful 😄

Thanks for your reply !
I will try that !

Another issue is the run_all_benchmark.sh; specifically, when I run the script, it just gives error saying:

Traceback (most recent call last):
Selected Tasks: ['hellaswag', 'winogrande', 'arc_challenge', 'boolq', 'openbookqa', 'arc_easy', 'piqa']
  File "/ssddata/zchenhj/LMFlow/utils/lm_evaluator.py", line 108, in <module>
    main()
  File "/ssddata/zchenhj/LMFlow/utils/lm_evaluator.py", line 79, in main
    results = evaluator.simple_evaluate(
  File "/ssddata/zchenhj/miniconda3/envs/lmflow/lib/python3.9/site-packages/lm_eval/utils.py", line 161, in _wrapper
    return fn(*args, **kwargs)
  File "/ssddata/zchenhj/miniconda3/envs/lmflow/lib/python3.9/site-packages/lm_eval/evaluator.py", line 64, in simple_evaluate
    lm = lm_eval.models.get_model(model).create_from_arg_string(
  File "/ssddata/zchenhj/miniconda3/envs/lmflow/lib/python3.9/site-packages/lm_eval/models/__init__.py", line 16, in get_model
    return MODEL_REGISTRY[model_name]
KeyError: 'hf-causal-experimental'
[2024-05-08 16:53:38,340] [INFO] [launch.py:347:main] Process 4024922 exits successfully.

I get into the code, I suspect that some parts of the code has not yet finished implementation, and then leads to the error ?

I run the code by bash run_all_benchmark.sh --model_name_or_path model_name

Thanks again for your help !

@2003pro I am wondering if you can take a look at this?

I suggest to switch lm-eval package's version back to 0.4.0. Just like:

git clone -b v0.0.4 https://github.com/EleutherAI/lm-evaluation-harness.git 
cd lm-evaluation-harness
pip install -e .

Also, if there any further issues, you may check if transformers' version is compatible. My environment's version is 4.33.3.