can't training codellama

Question

can't training codellama

xiaohangguo opened this issue a year ago · 2 comments

试了下codellama，发现用不了，难受

https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf

[2023-08-30 19:52:52,217] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-30 19:52:53,064] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-08-30 19:52:53,064] [INFO] [runner.py:555:main] cmd = /root/miniconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path /root/autodl-fs/codellama_CodeLlama-13b-Python-hf --dataset_path data/TED_data --output_dir output_models/finetune_LLama-2-13B --overwrite_output_dir --num_train_epochs 3 --learning_rate 2e-5 --block_size 512 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2023-08-30 19:52:54,885] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-30 19:52:55,734] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-08-30 19:52:55,734] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-08-30 19:52:55,734] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-08-30 19:52:55,734] [INFO] [launch.py:163:main] dist_world_size=4
[2023-08-30 19:52:55,734] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-08-30 19:52:58,126] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-30 19:52:58,225] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-30 19:52:58,244] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-30 19:52:58,258] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-30 19:53:00,545] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-30 19:53:00,545] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-30 19:53:00,545] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-30 19:53:00,573] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-30 19:53:00,573] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-30 19:53:00,576] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-30 19:53:00,576] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-30 19:53:00,613] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-30 19:53:00,613] [INFO] [comm.py:616:init_distributed] cdb=None
08/30/2023 19:53:01 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1,distributed training: True, 16-bits training: True
08/30/2023 19:53:01 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1,distributed training: True, 16-bits training: True
08/30/2023 19:53:02 - WARNING - lmflow.pipeline.finetuner - Process rank: 3, device: cuda:3, n_gpu: 1,distributed training: True, 16-bits training: True
08/30/2023 19:53:02 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
08/30/2023 19:53:02 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-5d6197ff7c318ede/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Traceback (most recent call last):
  File "/root/LMFlow-main/examples/finetune.py", line 61, in <module>
    main()
  File "/root/LMFlow-main/examples/finetune.py", line 54, in main
    model = AutoModel.get_model(model_args)
  File "/root/LMFlow-main/src/lmflow/models/auto_model.py", line 16, in get_model
    return HFDecoderModel(model_args, *args, **kwargs)
  File "/root/LMFlow-main/src/lmflow/models/hf_decoder_model.py", line 149, in __init__
    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 724, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class CodeLlamaTokenizer does not exist or is not currently imported.
08/30/2023 19:53:03 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-5d6197ff7c318ede/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
[2023-08-30 19:53:03,828] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2062
Traceback (most recent call last):
  File "/root/LMFlow-main/examples/finetune.py", line 61, in <module>
    main()
  File "/root/LMFlow-main/examples/finetune.py", line 54, in main
    model = AutoModel.get_model(model_args)
  File "/root/LMFlow-main/src/lmflow/models/auto_model.py", line 16, in get_model
    return HFDecoderModel(model_args, *args, **kwargs)
  File "/root/LMFlow-main/src/lmflow/models/hf_decoder_model.py", line 149, in __init__
    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
  File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 724, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class CodeLlamaTokenizer does not exist or is not currently imported.
[2023-08-30 19:53:03,925] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2063
[2023-08-30 19:53:03,925] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2064
[2023-08-30 19:53:04,019] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2065
[2023-08-30 19:53:04,360] [ERROR] [launch.py:321:sigkill_handler] ['/root/miniconda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=3', '--model_name_or_path', '/root/autodl-fs/codellama_CodeLlama-13b-Python-hf', '--dataset_path', 'data/TED_data', '--output_dir', 'output_models/finetune_LLama-2-13B', '--overwrite_output_dir', '--num_train_epochs', '3', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1

Answer 1 · 2023-09-09T08:54:05.000Z

Thanks for your interest in LMFlow! Seems like the problem is caused by transformer versions. Currently the stable version of LMFlow is v0.0.5, which uses an older version of transformers. To enable all features of LMFlow, you may try the unstable main branch first, which has several minor bugs but is fine when using codellama (I successfully run a chatbot with codellama/CodeLlama-7b-Instruct-hf on my server). Hope that solves your issue 😄

Answer 2 · 2023-09-20T02:35:06.000Z

Thanks for your interest in LMFlow! Seems like the problem is caused by transformer versions. Currently the stable version of LMFlow is v0.0.5, which uses an older version of transformers. To enable all features of LMFlow, you may try the unstable main branch first, which has several minor bugs but is fine when using codellama (I successfully run a chatbot with codellama/CodeLlama-7b-Instruct-hf on my server). Hope that solves your issue 😄

ok, i will update and try again.