What the hell is wrong with the repo, getting all weird images, don't care the steps.π£
ZeroCool22 opened this issue Β· 2 comments
Describe the bug
I installed everything following this guide: https://pastebin.com/uE1WcSxD
The only steps i did different are:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
sudo apt-get install cuda=11.8.0-1
Everything seems to works fine but (while training), when i create images with AUTOS's GUI (i tried INVOKEAI too, same issue) all the images looks weird.
In the next image, you can see the training data used and the result images when using AUTO's:
π‘And this is not a matter of steps numbers, i tried with different steps counts (1000 - 4500) and always get same weirds images.
I used this script to do the conversion to ckpt: https://pastebin.com/ct6mTzAA
But as i said before, it's not a conversion script problem, because i used the model in Diffusers format in INVOKEAI and i get the same weird results.
If someone could tell me what is wrong with the repo would be great.
Reproduction
Training process console:
(diffusers) zerocool@DESKTOP-MMG43AJ:~/github/diffusers/examples/dreambooth$ ./my_training.sh
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/accelerator.py:249: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of π€ Accelerate. Use `project_dir` instead.
warnings.warn(
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/transformers/modeling_utils.py:429: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: /home/zerocool/anaconda3/envs/diffusers did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: /usr/lib/wsl/lib: did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('runwayml/stable-diffusion-v1-5')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 118
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so...
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/configuration_utils.py:203: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Caching latents: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 35/35 [00:05<00:00, 6.10it/s]
04/19/2023 19:46:10 - INFO - __main__ - ***** Running training *****
04/19/2023 19:46:10 - INFO - __main__ - Num examples = 35
04/19/2023 19:46:10 - INFO - __main__ - Num batches each epoch = 35
04/19/2023 19:46:10 - INFO - __main__ - Num Epochs = 58
04/19/2023 19:46:10 - INFO - __main__ - Instantaneous batch size per device = 1
04/19/2023 19:46:10 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
04/19/2023 19:46:10 - INFO - __main__ - Gradient Accumulation steps = 1
04/19/2023 19:46:10 - INFO - __main__ - Total optimization steps = 2000
Steps: 1%|β | 19/2000 [00:33<45:12, 1.37s/it, loss=0.167, lr=1e-6]
Logs
No response
System Info
diffusers
version: 0.15.0.dev0- Platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python version: 3.9.16
- PyTorch version (GPU?): 2.0.0+cu118 (True)
- Huggingface_hub version: 0.13.4
- Transformers version: 4.28.1
- Accelerate version: 0.18.0
- xFormers version: 0.0.18
- Using GPU in script?: 1080 TI
- Using distributed or parallel set-up in script?:
Same here, but only when using 8 bit adam. Without it it works perfectly fine. It seems like 8 bit adam causes the model to overfit very quickly. I've been trying to find a workaround for days but haven't gotten anywhere so now I'm justing playing around with steps and learning rate until something works.
Same here, but only when using 8 bit adam. Without it it works perfectly fine. It seems like 8 bit adam causes the model to overfit very quickly. I've been trying to find a workaround for days but haven't gotten anywhere so now I'm justing playing around with steps and learning rate until something works.
Check your bitsandbytes, there is a good chance you have a version above 0.35.0 and if so downgrade your bitsandbytes to 0.35.0 and train again. Basically there have been known issues with any bitsandbytes above 0.35.0 since the end of 2022 when using AdamW8bit, etc.