Cannot resume finetuning flux checkpoint - resume option cannot find input path, but input path is correct ###kohya_ss GUI release v24.2.0###

Question

Cannot resume finetuning flux checkpoint - resume option cannot find input path, but input path is correct ###kohya_ss GUI release v24.2.0###

Closed this issue 2 months ago · 12 comments

bolli20000 commented 2 months ago

Please fix - cannot solve the issue running under Windows 11 and with v24.2.0

(venv) D:\flux_train\kohya_ss\venv\Scripts>accelerate launch --dynamo_backend no --gpu_ids 0 --mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 D:/flux_train/kohya_ss/sd-scripts/flux_train.py --config_file D:/Bilder/Project_AI/Train/model/config_dreambooth-20241011-001320.toml --resume D:/Bilder/Project_AI/Train/model/LostPlace-Inside-Flux-000120.safetensors
D:\flux_train\kohya_ss\venv\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
2024-10-12 23:57:52.785206: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2024-10-12 23:57:53.546950: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
D:\flux_train\kohya_ss\venv\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
D:\flux_train\kohya_ss\venv\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
2024-10-12 23:57:55 INFO Loading settings from D:/Bilder/Project_AI/Train/model/config_dreambooth-20241011-001320.toml... train_util.py:4328
INFO D:/Bilder/Project_AI/Train/model/config_dreambooth-20241011-001320 train_util.py:4347
2024-10-12 23:57:55 INFO Using DreamBooth method. flux_train.py:103
INFO prepare images. train_util.py:1872
INFO get image size from name of cache files train_util.py:1810
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 179/179 [00:00<00:00, 899.24it/s]
INFO set image size from cache files: 179/179 train_util.py:1817
INFO found directory D:\Bilder\Project_AI\Train\datasets\LostPlace-Inside_2024\1_Lostplace contains 179 image files train_util.py:1819
INFO 179 train images with repeating. train_util.py:1913
INFO 0 reg images. train_util.py:1916
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1921
INFO [Dataset 0] config_util.py:570
batch_size: 1
resolution: (1024, 1024)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: True

                           [Subset 0 of Dataset 0]
                             image_dir: "D:\Bilder\Project_AI\Train\datasets\LostPlace-Inside_2024\1_Lostplace"
                             image_count: 179
                             num_repeats: 1
                             shuffle_caption: False
                             keep_tokens: 0
                             keep_tokens_separator:
                             caption_separator: ,
                             secondary_separator: None
                             enable_wildcard: False
                             caption_dropout_rate: 0
                             caption_dropout_every_n_epoches: 0
                             caption_tag_dropout_rate: 0.0
                             caption_prefix: None
                             caption_suffix: None
                             color_aug: False
                             flip_aug: False
                             face_crop_aug_range: None
                             random_crop: False
                             token_warmup_min: 1,
                             token_warmup_step: 0,
                             alpha_mask: False,
                             is_reg: False
                             class_tokens: Lostplace
                             caption_extension: .txt


                INFO     [Dataset 0]                                                                                                                                                       config_util.py:576
                INFO     loading image sizes.                                                                                                                                               train_util.py:909

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 179/179 [00:00<?, ?it/s]
INFO make buckets train_util.py:915
WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / train_util.py:932
bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます
INFO number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む） train_util.py:961
INFO bucket 0: resolution (832, 1152), count: 1 train_util.py:966
INFO bucket 1: resolution (832, 1216), count: 9 train_util.py:966
INFO bucket 2: resolution (1024, 896), count: 2 train_util.py:966
INFO bucket 3: resolution (1088, 832), count: 1 train_util.py:966
INFO bucket 4: resolution (1088, 896), count: 1 train_util.py:966
INFO bucket 5: resolution (1152, 832), count: 3 train_util.py:966
INFO bucket 6: resolution (1152, 896), count: 1 train_util.py:966
INFO bucket 7: resolution (1216, 704), count: 1 train_util.py:966
INFO bucket 8: resolution (1216, 768), count: 5 train_util.py:966
INFO bucket 9: resolution (1216, 832), count: 146 train_util.py:966
INFO bucket 10: resolution (1280, 704), count: 1 train_util.py:966
INFO bucket 11: resolution (1280, 768), count: 3 train_util.py:966
INFO bucket 12: resolution (1344, 704), count: 2 train_util.py:966
INFO bucket 13: resolution (1408, 704), count: 1 train_util.py:966
INFO bucket 14: resolution (1472, 704), count: 2 train_util.py:966
INFO mean ar error (without repeats): 0.035069617602451975 train_util.py:971
INFO Checking the state dict: Diffusers or BFL, dev or schnell flux_utils.py:48
INFO prepare accelerator flux_train.py:173
accelerator device: cuda
INFO Building AutoEncoder flux_utils.py:100
INFO Loading state dict from D:/Forge/webui/models/VAE/ae.safetensors flux_utils.py:105
INFO Loaded AE: flux_utils.py:108
INFO [Dataset 0] train_util.py:2396
INFO caching latents with caching strategy. train_util.py:1017
INFO checking cache validity... train_util.py:1044
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 179/179 [00:00<00:00, 8137.13it/s]
INFO no latents to cache train_util.py:1087
D:\flux_train\kohya_ss\venv\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
2024-10-12 23:57:56 INFO Building CLIP flux_utils.py:113
INFO Loading state dict from D:/Forge/webui/models/text_encoder/clip_l.safetensors flux_utils.py:206
INFO Loaded CLIP: flux_utils.py:209
INFO Loading state dict from D:/Forge/webui/models/text_encoder/t5xxl_fp16.safetensors flux_utils.py:254
INFO Loaded T5xxl: flux_utils.py:257
2024-10-12 23:58:01 INFO [Dataset 0] train_util.py:2417
INFO caching Text Encoder outputs with caching strategy. train_util.py:1179
INFO checking cache validity... train_util.py:1185
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 179/179 [00:00<00:00, 4324.55it/s]
INFO no Text Encoder outputs to cache train_util.py:1207
INFO cache Text Encoder outputs for sample prompt: D:/Bilder/Project_AI/Train/model\sample/prompt.txt flux_train.py:236
INFO cache Text Encoder outputs for prompt: lostplace-inside of an abandoned and destroyed cinema, italian style, fire spots, overgrown from the outside, morning Sun flux_train.py:246
shines in through the broken roof, amazing and complex rokokko architecture, broken chairs a piled, very messy old cinema, dark colours and atmosphere
2024-10-12 23:58:02 INFO cache Text Encoder outputs for prompt: flux_train.py:246
INFO Checking the state dict: Diffusers or BFL, dev or schnell flux_utils.py:48
INFO Building Flux model dev from BFL checkpoint flux_utils.py:74
INFO Loading state dict from D:/Forge/webui/models/Stable-diffusion/flux1-dev.safetensors flux_utils.py:81
INFO Loaded Flux: flux_utils.py:93
FLUX: Gradient checkpointing enabled. CPU offload: False
INFO enable block swap: blocks_to_swap=10 flux_train.py:291
number of trainable parameters: 11901408320
prepare optimizer, data loader etc.
INFO use Adafactor optimizer | {'scale_parameter': False, 'relative_step': False, 'warmup_init': False, 'weight_decay': 0.01} train_util.py:4641
WARNING because max_grad_norm is set, clip_grad_norm is enabled. consider set to 0 / train_util.py:4669
max_grad_normが設定されているためclip_grad_normが有効になります。0に設定して無効にしたほうがいいかもしれません
WARNING constant_with_warmup will be good / スケジューラはconstant_with_warmupが良いかもしれません train_util.py:4673
enable full bf16 training.
INFO resume training from local state: D:/Bilder/Project_AI/Train/model/LostPlace-Inside-Flux-000120.safetensors train_util.py:4362
Traceback (most recent call last):
File "D:\flux_train\kohya_ss\sd-scripts\flux_train.py", line 994, in
train(args)
File "D:\flux_train\kohya_ss\sd-scripts\flux_train.py", line 461, in train
train_util.resume_from_local_or_hf_if_specified(accelerator, args)
File "D:\flux_train\kohya_ss\sd-scripts\library\train_util.py", line 4363, in resume_from_local_or_hf_if_specified
accelerator.load_state(args.resume)
File "D:\flux_train\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 3072, in load_state
raise ValueError(f"Tried to find {input_dir} but folder does not exist")
ValueError: Tried to find D:/Bilder/Project_AI/Train/model/LostPlace-Inside-Flux-000120.safetensors but folder does not exist
Traceback (most recent call last):
File "C:\Users\sabot\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\sabot\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\flux_train\kohya_ss\venv\Scripts\accelerate.exe_main.py", line 7, in
File "D:\flux_train\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File "D:\flux_train\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
simple_launcher(args)
File "D:\flux_train\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\flux_train\kohya_ss\venv\Scripts\python.exe', 'D:/flux_train/kohya_ss/sd-scripts/flux_train.py', '--config_file', 'D:/Bilder/Project_AI/Train/model/config_dreambooth-20241011-001320.toml', '--resume', 'D:/Bilder/Project_AI/Train/model/LostPlace-Inside-Flux-000120.safetensors']' returned non-zero exit status 1.

(venv) D:\flux_train\kohya_ss\venv\Scripts>

Answer 1 · 2024-10-12T23:46:56.000Z

When resuming, specify the state directory for --resume option, not the safetensors file.

Answer 2 · 2024-10-13T06:59:46.000Z

Thanks @kohya-ss - but this leads to fthe following result:

D:\flux_train\kohya_ss\venv\lib\site-packages\accelerate\checkpointing.py:212: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(input_model_file, map_location=map_location)
Traceback (most recent call last):
File "D:\flux_train\kohya_ss\sd-scripts\flux_train.py", line 994, in
train(args)
File "D:\flux_train\kohya_ss\sd-scripts\flux_train.py", line 461, in train
train_util.resume_from_local_or_hf_if_specified(accelerator, args)
File "D:\flux_train\kohya_ss\sd-scripts\library\train_util.py", line 4363, in resume_from_local_or_hf_if_specified
accelerator.load_state(args.resume)
File "D:\flux_train\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 3145, in load_state
override_attributes = load_accelerator_state(
File "D:\flux_train\kohya_ss\venv\lib\site-packages\accelerate\checkpointing.py", line 212, in load_accelerator_state
state_dict = torch.load(input_model_file, map_location=map_location)
File "D:\flux_train\kohya_ss\venv\lib\site-packages\torch\serialization.py", line 1319, in load
with _open_file_like(f, "rb") as opened_file:
File "D:\flux_train\kohya_ss\venv\lib\site-packages\torch\serialization.py", line 659, in _open_file_like
return _open_file(name_or_buffer, mode)
File "D:\flux_train\kohya_ss\venv\lib\site-packages\torch\serialization.py", line 640, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'D:\Bilder\Project_AI\Train\model\pytorch_model.bin'
Traceback (most recent call last):
File "C:\Users\sabot\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\sabot\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\flux_train\kohya_ss\venv\Scripts\accelerate.exe_main.py", line 7, in
File "D:\flux_train\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File "D:\flux_train\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
simple_launcher(args)
File "D:\flux_train\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\flux_train\kohya_ss\venv\Scripts\python.exe', 'D:/flux_train/kohya_ss/sd-scripts/flux_train.py', '--config_file', 'D:/Bilder/Project_AI/Train/model/config_dreambooth-20241011-001320.toml', '--resume', 'D:/Bilder/Project_AI/Train/model']' returned non-zero exit status 1.

Answer 3 · 2024-10-13T08:33:20.000Z

If you specify the --save_state option, a directory named *-state should be saved. Please specify that directory for --resume option.

Answer 4 · 2024-10-14T07:37:56.000Z

I retrained on the scratch with checked "Save training state", but seems not to save any training state in any *-state directory, hmm...

Answer 5 · 2024-10-14T07:55:49.000Z

Hmm, the directory (or directories) with a name ending in state should be created in the model save directory...

Answer 6 · 2024-10-14T10:44:50.000Z

Hmm after every trained epoch ? Or after every "Save every N epochs" ?

Answer 7 · 2024-10-14T11:35:32.000Z

It is the same time that the safetensors file is saved.

Answer 8 · 2024-10-14T12:31:04.000Z

Let me wait still 6 Epochs to train, after I will see the wished result...

Answer 9 · 2024-10-14T15:15:29.000Z

Great, works - -state directory was created. Thank you. I think issue is no issue and can be closed. Thank you kohya_ss for the perfect support.

Best Bolli

Answer 10 · 2024-10-15T21:18:56.000Z

No luck, today I resumed a training from last saved state. But think training startet again from 0 ?

Answer 11 · 2024-10-15T22:46:41.000Z

This is a specification of Accelerate, and the step number is not recorded in the state. Training continues correctly. However, this has been fixed in the latest Accelerate, so I would like to apply it soon.

Answer 12 · 2024-10-16T08:31:51.000Z

Thank you, i.e. the training will restart on the saved state correctly ? Just the epoch counter starting at 0 is to ignore and train just #epochs - #epochs(until saved state) ?