How to Properly Resume Training w/ Respect to Schedulers?
Opened this issue · 10 comments
Not sure if it's an error but I'm having the issue where resuming the model training using the --resume command seems to load everything up successfully, but the training steps are reset back to 0, as opposed to picking up from the last epoch it was trained on.
Ex: If I stop training after 2/10 epochs, or on the second state, and then resume training, it'll load up the state folder successfully but will start back at 0/10 epochs, which I would assume would result in 12 epochs being trained instead of 10(?)
I would assume the fix in this example would be to decrease the max epochs so that it's 8 instead of 10, resulting in 10 max epochs being trained instead of 12, but it seems like this would completely mess up the schedulers and timesteps. Is this intentional, or is there some other command I need to include or an existing command I need to modify alongside the --resume command?
Also, finetuning Flux just for reference.
我在进行SDXL的大模型训练时遇到了同样的问题,在训练lora的时候恢复训练起了作用,但是大模型却被重置了。
The displayed epoch will be reset to 0, but the scheduler will restore its state correctly. You can check this with TensorBoard, etc.
The reason why the epochs are not restored is because the accelerate state save does not record the number of epochs. LoRA training script has been extended to record the number of epochs on its own, but other scripts do not support this.
The displayed epoch will be reset to 0, but the scheduler will restore its state correctly. You can check this with TensorBoard, etc.
The reason why the epochs are not restored is because the accelerate state save does not record the number of epochs. LoRA training script has been extended to record the number of epochs on its own, but other scripts do not support this.
In this case what is the optimal method for resuming training then?
Should the --max_train_epochs command remain the same and then just end the training early when it otherwise would have ended had you not interrupted training, or should you update the --max_train_epochs command to subtract the steps that have already been trained? Or something else entirely?
or should you update the --max_train_epochs command to subtract the steps that have already been trained?
This way you can train for the number of epochs you originally specified.
I've been training a LoRA for Flux using this command:
accelerate launch --mixed_precision bf16 --num_processes 1 --num_cpu_threads_per_process 1 flux_train_network.py --logging_dir logs --log_prefix=test --pretrained_model_name_or_path models/unet/flux1-dev.sft --clip_l models/clip/clip_l.safetensors --t5xxl models/clip/t5xxl_fp16.safetensors --ae models/vae/ae.sft --cache_latents --cache_latents_to_disk --sample_prompts prompt.txt --sample_every_n_epochs 10 --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --network_module networks.lora_flux --network_dim 32 --network_alpha 32 --learning_rate 1e-4 --network_train_unet_only --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base --highvram --max_train_epochs 400 --save_every_n_epochs 10 --save_state --dataset_config dataset.toml --output_dir training-output/ --output_name test-flux --timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1.0 --loss_type l2 --optimizer_type adafactor --optimizer_args relative_step=False scale_parameter=False warmup_init=False --lr_scheduler constant_with_warmup --max_grad_norm 0.0
I am running the latest commit ce14447 on the sd3
branch (latest as of yesterday).
Resuming training with --resume <state dir>
shows the epoch as 11 instead of the actual epoch increment from the state file.
For example if I resume from epoch 120, it should show epoch as 121 when resuming training, but it shows it as 11 and the number of steps are also back to the total number of steps left at epoch 11 (28,860 instead of 20,720).
This only happens when resuming more than once, the first time it works properly with the correct epoch (e.g. 111) and step reduction (e.g. 29600 -> 21460)
Here are the relevant log lines:
INFO load train state from training-output/test-flux-000110-state/train_state.json: {'current_epoch': 120, 'current_step': 740}
INFO All model weights loaded successfully
INFO All optimizer states loaded successfully
INFO All scheduler states loaded successfully
INFO All dataloader sampler states loaded successfully
INFO All random states loaded successfully
INFO Loading in 0 custom states
INFO epoch is incremented. current_epoch: 0, epoch: 11
INFO epoch is incremented. current_epoch: 0, epoch: 11
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 74
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 74
num epochs / epoch数: 400
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 29600
steps: 0%| | 15/28860 [01:52<60:01:50, 7.49s/it, avr_loss=0.373]^C
Here it is working properly if we haven't resumed before:
INFO load train state from training-output/test-flux-000110-state/train_state.json: {'current_epoch': 110, 'current_step': 8140}
INFO epoch is incremented. current_epoch: 0, epoch: 111
steps: 0%|▍ | 74/21460 [06:14<30:03:48, 5.06s/it, avr_loss=0.326]
Maybe this happens because it tries to resume from the number of steps alone instead of epoch, and the steps get reset when we have resumed training once already.
Is this expected behaviour or a bug?
Maybe this happens because it tries to resume from the number of steps alone instead of epoch, and the steps get reset when we have resumed training once already.
Is this expected behaviour or a bug?
This seems to require additional investigation... Please wait a moment.
Hey @EQuioMaX . I just run into the same problem. You got any idea on what's the correct way to resume more than once?
@SadaleNet I am not very familiar with what happens internally, but I manually edited train_state.json
in the latest state folder with the correct number of steps and it seems to be continuing training from the correct epoch. I have yet to see if there are any differences in behaviour compared to an uninterrupted run though, I'll test the finished LoRA later.
-{"current_epoch": 130, "current_step": 1100}
+{"current_epoch": 130, "current_step": 7150}
@EQuioMaX That's what I was considering doing as well. But just like you are, I'm not sure if it'd work correctly. And I can't quite read ML code so I don't bother reading the source code for now.
Just in case someone's wondering, I empirically found that lora training apparently isn't deterministic. Running the same lora training command twice would generate two different safetensors files. Therefore, it isn't possible to test if @EQuioMaX's method would works by comparing if the output files are identical.
Let's see if the author of this repo would have any words on that.