can't training codellama
xiaohangguo opened this issue · 2 comments
试了下codellama,发现用不了,难受
https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf
[2023-08-30 19:52:52,217] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-30 19:52:53,064] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-08-30 19:52:53,064] [INFO] [runner.py:555:main] cmd = /root/miniconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path /root/autodl-fs/codellama_CodeLlama-13b-Python-hf --dataset_path data/TED_data --output_dir output_models/finetune_LLama-2-13B --overwrite_output_dir --num_train_epochs 3 --learning_rate 2e-5 --block_size 512 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2023-08-30 19:52:54,885] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-30 19:52:55,734] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-08-30 19:52:55,734] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-08-30 19:52:55,734] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-08-30 19:52:55,734] [INFO] [launch.py:163:main] dist_world_size=4
[2023-08-30 19:52:55,734] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-08-30 19:52:58,126] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-30 19:52:58,225] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-30 19:52:58,244] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-30 19:52:58,258] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-30 19:53:00,545] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-30 19:53:00,545] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-30 19:53:00,545] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-30 19:53:00,573] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-30 19:53:00,573] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-30 19:53:00,576] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-30 19:53:00,576] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-08-30 19:53:00,613] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-30 19:53:00,613] [INFO] [comm.py:616:init_distributed] cdb=None
08/30/2023 19:53:01 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1,distributed training: True, 16-bits training: True
08/30/2023 19:53:01 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1,distributed training: True, 16-bits training: True
08/30/2023 19:53:02 - WARNING - lmflow.pipeline.finetuner - Process rank: 3, device: cuda:3, n_gpu: 1,distributed training: True, 16-bits training: True
08/30/2023 19:53:02 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
08/30/2023 19:53:02 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-5d6197ff7c318ede/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Traceback (most recent call last):
File "/root/LMFlow-main/examples/finetune.py", line 61, in <module>
main()
File "/root/LMFlow-main/examples/finetune.py", line 54, in main
model = AutoModel.get_model(model_args)
File "/root/LMFlow-main/src/lmflow/models/auto_model.py", line 16, in get_model
return HFDecoderModel(model_args, *args, **kwargs)
File "/root/LMFlow-main/src/lmflow/models/hf_decoder_model.py", line 149, in __init__
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 724, in from_pretrained
raise ValueError(
ValueError: Tokenizer class CodeLlamaTokenizer does not exist or is not currently imported.
08/30/2023 19:53:03 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-5d6197ff7c318ede/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
[2023-08-30 19:53:03,828] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2062
Traceback (most recent call last):
File "/root/LMFlow-main/examples/finetune.py", line 61, in <module>
main()
File "/root/LMFlow-main/examples/finetune.py", line 54, in main
model = AutoModel.get_model(model_args)
File "/root/LMFlow-main/src/lmflow/models/auto_model.py", line 16, in get_model
return HFDecoderModel(model_args, *args, **kwargs)
File "/root/LMFlow-main/src/lmflow/models/hf_decoder_model.py", line 149, in __init__
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
File "/root/miniconda3/envs/lmflow/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 724, in from_pretrained
raise ValueError(
ValueError: Tokenizer class CodeLlamaTokenizer does not exist or is not currently imported.
[2023-08-30 19:53:03,925] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2063
[2023-08-30 19:53:03,925] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2064
[2023-08-30 19:53:04,019] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2065
[2023-08-30 19:53:04,360] [ERROR] [launch.py:321:sigkill_handler] ['/root/miniconda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=3', '--model_name_or_path', '/root/autodl-fs/codellama_CodeLlama-13b-Python-hf', '--dataset_path', 'data/TED_data', '--output_dir', 'output_models/finetune_LLama-2-13B', '--overwrite_output_dir', '--num_train_epochs', '3', '--learning_rate', '2e-5', '--block_size', '512', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1
Thanks for your interest in LMFlow! Seems like the problem is caused by transformer
versions. Currently the stable version of LMFlow is v0.0.5, which uses an older version of transformers. To enable all features of LMFlow, you may try the unstable main
branch first, which has several minor bugs but is fine when using codellama (I successfully run a chatbot with codellama/CodeLlama-7b-Instruct-hf
on my server). Hope that solves your issue 😄
Thanks for your interest in LMFlow! Seems like the problem is caused by
transformer
versions. Currently the stable version of LMFlow is v0.0.5, which uses an older version of transformers. To enable all features of LMFlow, you may try the unstablemain
branch first, which has several minor bugs but is fine when using codellama (I successfully run a chatbot withcodellama/CodeLlama-7b-Instruct-hf
on my server). Hope that solves your issue 😄
ok, i will update and try again.