TheLastBen/PPS

SDXL training at paperspace is erroring out.

Opened this issue · 1 comments

My SDXL training at paperspace is erroring out with the following exception. I was able to train the model but subsequent to that it errors out always. I have changed the VM but its of no use.
https://blog.paperspace.com/training-a-lora-model-for-stable-diffusion-xl-with-paperspace/

Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/diffusers/models/modeling_utils.py", line 109, in load_state_dict
return safetensors.torch.load_file(checkpoint_file, device="cpu")
File "/usr/local/lib/python3.9/dist-packages/safetensors/torch.py", line 100, in load_file
result[k] = f.get_tensor(k)
RuntimeError: shape '[10240, 1280]' is invalid for input of size 8767542

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/diffusers/models/modeling_utils.py", line 113, in load_state_dict
if f.read().startswith("version"):
File "/usr/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 223007: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/notebooks/diffusers/examples/dreambooth/train_dreambooth_sdxl_lora.py", line 941, in
main()
File "/notebooks/diffusers/examples/dreambooth/train_dreambooth_sdxl_lora.py", line 687, in main
unet = UNet2DConditionModel.from_pretrained(
File "/usr/local/lib/python3.9/dist-packages/diffusers/models/modeling_utils.py", line 599, in from_pretrained
state_dict = load_state_dict(model_file, variant=variant)
File "/usr/local/lib/python3.9/dist-packages/diffusers/models/modeling_utils.py", line 125, in load_state_dict
raise OSError(
OSError: Unable to load weights from checkpoint file for '/notebooks/stable-diffusion-XL/unet/diffusion_pytorch_model.safetensors' at '/notebooks/stable-diffusion-XL/unet/diffusion_pytorch_model.safetensors'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3.9', '/notebooks/diffusers/examples/dreambooth/train_dreambooth_sdxl_lora.py', '--saves=[30,60]', '--dim=64', '--ofstnselvl=0', '--image_captions_filename', '--Session_dir=/notebooks/Fast-Dreambooth/Sessions/Myras-Session', '--pretrained_model_name_or_path=/notebooks/stable-diffusion-XL', '--instance_data_dir=/notebooks/Fast-Dreambooth/Sessions/Myras-Session/instance_images', '--output_dir=/notebooks/models/Myras-Session', '--captions_dir=/notebooks/Fast-Dreambooth/Sessions/Myras-Session/captions', '--seed=361935', '--resolution=1024', '--mixed_precision=fp16', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--use_8bit_adam', '--learning_rate=1e-6', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--num_train_epochs=120']' returned non-zero exit status 1.

in a new cell, run !rm -r /notebooks/stable-diffusion-XL, then run again the model download cell, and start training