BlinkDL/RWKV-LM

微调IndexError: list index out of range

aolerv opened this issue · 1 comments

(rwkv5_py310) root@autodl-container-f97d11abac-813971fc:~/autodl-tmp/RWKV-LM-main/RWKV-v5# ./demo-training-run.sh
INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmpb972vb43
INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmpb972vb43/_remote_module_non_scriptable.py
INFO:pytorch_lightning.utilities.rank_zero:########## work in progress ##########
/root/miniconda3/envs/rwkv5_py310/lib/python3.10/site-packages/pydantic/_internal/_config.py:321: UserWarning: Valid config keys have changed in V2:

  • 'allow_population_by_field_name' has been renamed to 'populate_by_name'
  • 'validate_all' has been renamed to 'validate_default'
    warnings.warn(message, UserWarning)
    /root/miniconda3/envs/rwkv5_py310/lib/python3.10/site-packages/pydantic/_internal/fields.py:149: UserWarning: Field "model_persistence_threshold" has conflict with protected namespace "model".

You may be able to resolve this warning by setting model_config['protected_namespaces'] = ().
warnings.warn(
/root/miniconda3/envs/rwkv5_py310/lib/python3.10/site-packages/pydantic/_internal/_config.py:321: UserWarning: Valid config keys have changed in V2:

  • 'validate_all' has been renamed to 'validate_default'
    warnings.warn(message, UserWarning)
    Files in model/0.1-1: ['.ipynb_checkpoints']
    Traceback (most recent call last):
    File "/root/autodl-tmp/RWKV-LM-main/RWKV-v5/train.py", line 165, in
    max_p = list_p[-1]
    IndexError: list index out of range

#!/bin/bash

BASE_NAME="model/0.1-1"
N_LAYER="32"
N_EMBD="2560"
M_BSZ="16" # takes 16G VRAM (reduce this to save VRAM)
LR_INIT="1e-5"
LR_FINAL="1e-5"
GRAD_CP=0 # set to 1 to save VRAM (will be slower)
EPOCH_SAVE=10

magic_prime = the largest 3n+2 prime smaller than datalen/ctxlen-1 (= 1498226207/512-1 = 2926222.06 in this case)

use https://www.dcode.fr/prime-numbers-search

python train.py --load_model "/root/autodl-tmp/RWKV-LM-main/RWKV-v5/rwkv-5-World-3B-v2-20231113-ctx4096.pth" --wandb "RWKV-5-Test" --proj_dir $BASE_NAME
--ctx_len 4096 --my_pile_stage 3 --epoch_count 999999 --epoch_begin 0
--data_file "text" --my_exit_tokens 20021619 --magic_prime 4877
--num_nodes 1 --micro_bsz $M_BSZ --n_layer $N_LAYER --n_embd $N_EMBD --pre_ffn 0 --head_qk 0
--lr_init $LR_INIT --lr_final $LR_FINAL --warmup_steps 10 --beta1 0.9 --beta2 0.99 --adam_eps 1e-8 --my_pile_edecay 0 --data_type "binidx" --vocab_size 65536
--weight_decay 0.001 --epoch_save $EPOCH_SAVE --head_size_a 64
--accelerator gpu --devices 1 --precision bf16 --strategy deepspeed_stage_2 --grad_cp $GRAD_CP --enable_progress_bar True --ds_bucket_mb 200

环境按照这个装的,pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install pytorch-lightning==1.9.5 deepspeed==0.7.0 wandb ninja

rename rwkv-5-World-3B-v2-20231113-ctx4096.pth as rwkv-init.pth and put it in $BASE_NAME