The raft problem,when I tried to run "./scripts/run_raft_align.sh"
Mikivishy opened this issue · 4 comments
I can run the finetune correctly,but I dont know why there is a wrong like this when I tried to run "./scripts/run_raft_align.sh".I cannot find a solution.
(lmflow) [sunhaoyu@pkuhd4 LMFlow]$ ./scripts/run_raft_align.sh
[2023-09-07 18:56:18,777] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-07 18:56:33,637] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-09-07 18:56:33,637] [INFO] [runner.py:555:main] cmd = /data1/sunhaoyu/miniconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNV19 --master_addr=127.0.0.1 --master_port=11110 --enable_each_rank_log=None examples/raft_align.py --model_name_or_path gpt2 --num_raft_iteration 20 --learning_rate 2e-5 --lr_scheduler_type constant --bf16 False --deepspeed configs/ds_config_zero2.json --dataset_path /data1/sunhaoyu/LMFlow/data/hh_rlhf/rlhf/rlhf_prompt --output_reward_path /data1/sunhaoyu/LMFlow/tmp/raft_aligner/reward.txt --output_dir /data1/sunhaoyu/LMFlow/output_models/raft_align --overwrite_output_dir --run_name raft_align --num_train_epochs 4 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --validation_split_percentage 0 --logging_steps 1 --do_train --ddp_timeout 72000 --save_steps 7777 --dataloader_num_workers 1 --preprocessing_num_workers 12 --inference_batch_size_per_device 1 --collection_strategy local --raft_batch_size 1024 --output_min_length 96 --top_reward_percentage 0.125
[2023-09-07 18:56:35,453] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-07 18:56:36,500] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5]}
[2023-09-07 18:56:36,501] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=6, node_rank=0
[2023-09-07 18:56:36,501] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5]})
[2023-09-07 18:56:36,501] [INFO] [launch.py:163:main] dist_world_size=6
[2023-09-07 18:56:36,501] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5
[2023-09-07 18:56:39,098] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-07 18:56:39,194] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-07 18:56:39,194] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-07 18:56:39,220] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-07 18:56:39,232] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-07 18:56:39,256] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-07 18:56:42,930] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-07 18:56:42,931] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-09-07 18:56:42,972] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-07 18:56:42,972] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-09-07 18:56:42,972] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-09-07 18:56:43,015] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-07 18:56:43,015] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-09-07 18:56:43,053] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-07 18:56:43,053] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-09-07 18:56:43,069] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-07 18:56:43,069] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-09-07 18:56:43,084] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-07 18:56:43,084] [INFO] [comm.py:616:init_distributed] cdb=None
09/07/2023 18:56:49 - WARNING - datasets.builder - Found cached dataset json (/home/sunhaoyu/.cache/huggingface/datasets/json/default-e0b5db21ea4785ff/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
09/07/2023 18:56:49 - WARNING - datasets.builder - Found cached dataset json (/home/sunhaoyu/.cache/huggingface/datasets/json/default-e0b5db21ea4785ff/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
09/07/2023 18:56:49 - WARNING - datasets.builder - Found cached dataset json (/home/sunhaoyu/.cache/huggingface/datasets/json/default-e0b5db21ea4785ff/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
09/07/2023 18:56:49 - WARNING - datasets.builder - Found cached dataset json (/home/sunhaoyu/.cache/huggingface/datasets/json/default-e0b5db21ea4785ff/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
09/07/2023 18:56:49 - WARNING - datasets.builder - Found cached dataset json (/home/sunhaoyu/.cache/huggingface/datasets/json/default-e0b5db21ea4785ff/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
09/07/2023 18:56:49 - WARNING - datasets.builder - Found cached dataset json (/home/sunhaoyu/.cache/huggingface/datasets/json/default-e0b5db21ea4785ff/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Downloading (…)lve/main/config.json: 100%|███████████████████████████████████████████████| 665/665 [00:00<00:00, 82.1kB/s]
Downloading (…)olve/main/vocab.json: 100%|███████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 1.69MB/s]
Downloading (…)olve/main/merges.txt: 100%|█████████████████████████████████████████████| 456k/456k [00:00<00:00, 12.4MB/s]
Downloading (…)/main/tokenizer.json: 100%|███████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 1.85MB/s]
Downloading model.safetensors: 100%|███████████████████████████████████████████████████| 548M/548M [00:56<00:00, 9.71MB/s]
Downloading (…)neration_config.json: 100%|███████████████████████████████████████████████| 124/124 [00:00<00:00, 19.6kB/s]
Map: 0%| | 88/112052 [00:00<02:09, 861.36 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1268 > 1024). Running this sequence through the model will result in indexing errors
Map: 1%|▌ | 1000/112052 [00:00<01:38, 1132.48 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1268 > 1024). Running this sequence through the model will result in indexing errors
Map: 1%|▎ | 587/112052 [00:00<01:32, 1204.24 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1268 > 1024). Running this sequence through the model will result in indexing errors
Map: 0%| | 154/112052 [00:00<02:19, 803.56 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1268 > 1024). Running this sequence through the model will result in indexing errors
Map: 0%| | 177/112052 [00:00<02:02, 915.70 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1268 > 1024). Running this sequence through the model will result in indexing errors
Map: 1%|▋ | 1204/112052 [00:01<01:30, 1226.09 examples/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1268 > 1024). Running this sequence through the model will result in indexing errors
/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/raft_aligner.py:568: FutureWarning: set_caching_enabled is deprecated and will be removed in the next major version of datasets. Use datasets.enable_caching() or datasets.disable_caching() instead. This function will be removed in a future version of datasets.
set_caching_enabled(False)
170 8
/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/raft_aligner.py:568: FutureWarning: set_caching_enabled is deprecated and will be removed in the next major version of datasets. Use datasets.enable_caching() or datasets.disable_caching() instead. This function will be removed in a future version of datasets.
set_caching_enabled(False)
170 8
/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/raft_aligner.py:568: FutureWarning: set_caching_enabled is deprecated and will be removed in the next major version of datasets. Use datasets.enable_caching() or datasets.disable_caching() instead. This function will be removed in a future version of datasets.
set_caching_enabled(False)
170 8
RaftAlignerArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
collection_strategy=local,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=1,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=72000,
debug=[],
deepspeed=configs/ds_config_zero2.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
inference_batch_size_per_device=1,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=2e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/data1/sunhaoyu/LMFlow/output_models/raft_align/runs/Sep07_18-56-42_pkuhd4.localdomain,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=1.0,
logging_strategy=steps,
lr_scheduler_type=constant,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_raft_iteration=20,
num_train_epochs=4.0,
optim=adamw_torch,
optim_args=None,
output_dir=/data1/sunhaoyu/LMFlow/output_models/raft_align,
output_max_length=128,
output_min_length=96,
output_reward_path=/data1/sunhaoyu/LMFlow/tmp/raft_aligner/reward.txt,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
raft_batch_size=1024,
ray_scope=last,
remove_unused_columns=True,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=raft_align,
save_on_each_node=False,
save_safetensors=False,
save_steps=7777,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
top_reward_percentage=0.125,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
Filter: 80%|████████████████████████████████████████████████▏ | 90000/112052 [00:05<00:01, 15384.62 examples/s]Traceback (most recent call last):
File "/data1/sunhaoyu/LMFlow/examples/raft_align.py", line 155, in
main()
File "/data1/sunhaoyu/LMFlow/examples/raft_align.py", line 147, in main
aligned_model = aligner.align(
File "/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/raft_aligner.py", line 616, in align
raft_trainer.train(resume_from_checkpoint=False, is_first_time=True)
File "/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/utils/raft_trainer.py", line 1600, in train
return inner_training_loop1(
File "/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/utils/raft_trainer.py", line 2021, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
TypeError: deepspeed_init() got an unexpected keyword argument 'resume_from_checkpoint'
Filter: 87%|████████████████████████████████████████████████████▍ | 98000/112052 [00:06<00:00, 15555.90 examples/s]Traceback (most recent call last):
File "/data1/sunhaoyu/LMFlow/examples/raft_align.py", line 155, in
main()
File "/data1/sunhaoyu/LMFlow/examples/raft_align.py", line 147, in main
aligned_model = aligner.align(
File "/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/raft_aligner.py", line 616, in align
raft_trainer.train(resume_from_checkpoint=False, is_first_time=True)
File "/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/utils/raft_trainer.py", line 1600, in train
return inner_training_loop1(
File "/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/utils/raft_trainer.py", line 2021, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
TypeError: deepspeed_init() got an unexpected keyword argument 'resume_from_checkpoint'
/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/raft_aligner.py:568: FutureWarning: set_caching_enabled is deprecated and will be removed in the next major version of datasets. Use datasets.enable_caching() or datasets.disable_caching() instead. This function will be removed in a future version of datasets.
set_caching_enabled(False)
170 8
Filter: 82%|█████████████████████████████████████████████████▎ | 92000/112052 [00:05<00:01, 14649.02 examples/s]Traceback (most recent call last):
File "/data1/sunhaoyu/LMFlow/examples/raft_align.py", line 155, in
main()
File "/data1/sunhaoyu/LMFlow/examples/raft_align.py", line 147, in main
aligned_model = aligner.align(
File "/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/raft_aligner.py", line 616, in align
raft_trainer.train(resume_from_checkpoint=False, is_first_time=True)
File "/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/utils/raft_trainer.py", line 1600, in train
return inner_training_loop1(
File "/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/utils/raft_trainer.py", line 2021, in _inner_training_loop
deepspeed_engine, optimizer, lr_scheduler = deepspeed_init(
TypeError: deepspeed_init() got an unexpected keyword argument 'resume_from_checkpoint'
Filter: 98%|█████████████████████████████████████████████████████████▉ | 110000/112052 [00:07<00:00, 14557.99 examples/s][2023-09-07 19:00:08,009] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 120805
/data1/sunhaoyu/LMFlow/src/lmflow/pipeline/raft_aligner.py:568: FutureWarning: set_caching_enabled is deprecated and will be removed in the next major version of datasets. Use datasets.enable_caching() or datasets.disable_caching() instead. This function will be removed in a future version of datasets.
set_caching_enabled(False)
170 8
[2023-09-07 19:00:08,327] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 120806
Filter: 89%|████████████████████████████████████████████████████▋ | 100000/112052 [00:06<00:01, 11927.99 examples/s][2023-09-07 19:00:08,484] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 120807
Filter: 91%|█████████████████████████████████████████████████████▋ | 102000/112052 [00:06<00:00, 12164.04 examples/s][2023-09-07 19:00:08,802] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 120808
[2023-09-07 19:00:08,802] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 120809
[2023-09-07 19:00:09,161] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 120810
[2023-09-07 19:00:09,719] [ERROR] [launch.py:321:sigkill_handler] ['/data1/sunhaoyu/miniconda3/envs/lmflow/bin/python', '-u', 'examples/raft_align.py', '--local_rank=5', '--model_name_or_path', 'gpt2', '--num_raft_iteration', '20', '--learning_rate', '2e-5', '--lr_scheduler_type', 'constant', '--bf16', 'False', '--deepspeed', 'configs/ds_config_zero2.json', '--dataset_path', '/data1/sunhaoyu/LMFlow/data/hh_rlhf/rlhf/rlhf_prompt', '--output_reward_path', '/data1/sunhaoyu/LMFlow/tmp/raft_aligner/reward.txt', '--output_dir', '/data1/sunhaoyu/LMFlow/output_models/raft_align', '--overwrite_output_dir', '--run_name', 'raft_align', '--num_train_epochs', '4', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--validation_split_percentage', '0', '--logging_steps', '1', '--do_train', '--ddp_timeout', '72000', '--save_steps', '7777', '--dataloader_num_workers', '1', '--preprocessing_num_workers', '12', '--inference_batch_size_per_device', '1', '--collection_strategy', 'local', '--raft_batch_size', '1024', '--output_min_length', '96', '--top_reward_percentage', '0.125'] exits with return code = 1
Could you provide more contexts?
I just create a new environment and install the lmflow as in the readme. Then I run the script of raft successfully except that I need to modify the dataset address.
This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks
Could you provide more contexts?
I just create a new environment and install the lmflow as in the readme. Then I run the script of raft successfully except that I need to modify the dataset address.
Hello, have you solved this problem? same issue happen in my computer
Could you provide more contexts?
I just create a new environment and install the lmflow as in the readme. Then I run the script of raft successfully except that I need to modify the dataset address.Hello, have you solved this problem? same issue happen in my computer
I think it might be because we have updated many packages when developing LMFlow...
You may try out https://github.com/WeiXiongUST/LMFlow_RAFT_Dev where we split a rather stable branch for RAFT.