peterwilli/sd-leap-booster

Stable diffusion 1.4 model doesn't work

Opened this issue · 24 comments

The following error occurs when using a model derived from StableDiffusion 1.4.

output_feat torch.Size([1, 64, 4, 4])
Conv length: 1024
Traceback (most recent call last):
  File "/venv/bin/leap_textual_inversion", line 780, in <module>
    main()
  File "/venv/bin/leap_textual_inversion", line 540, in main
    token_embeds[placeholder_token_id] = boosted_embed
RuntimeError: The expanded size of the tensor (768) must match the existing size (1024) at non-singleton dimension 0.  Target sizes: [768].  Tensor sizes: [1024]

First time using this repo, and I get the same error when pointing it to runwayml/stable-diffusion-v1-5 with --pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5.

My command:
python leap_textual_inversion --pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 --placeholder_token="<woman>" --train_data_dir="C:\mytrainingpics" --learning_rate=0.005 --save_steps 25 --learnable_property "object" --repeats 100 --resolution 512 --train_batch_size 16 --max_train_steps 1000 --gradient_accumulation_steps 1 --learning_rate 0.005 --lr_scheduler "constant" --lr_warmup_steps 5 --enable_xformers_memory_efficient_attention

Then I tried pointing it to the sd-1-5.ckpt in my Automatic1111 installation with ="C:\Stuff\AI\Stable Diffusion\models\Stable-diffusion\sd-v1-5.ckpt and it gives this error:

01/10/2023 12:50:22 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cpu
Mixed precision type: no

Traceback (most recent call last):
  File "C:\Stuff\AI\sd-leap-booster-main\bin\leap_textual_inversion", line 780, in <module>
    main()
  File "C:\Stuff\AI\sd-leap-booster-main\bin\leap_textual_inversion", line 512, in main
    tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path, subfolder="tokenizer")
  File "C:\Users\Zyin\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\tokenization_utils_base.py", line 1699, in from_pretrained
    raise ValueError(
ValueError: Calling CLIPTokenizer.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.

I got a bit further on SD1.5 with this command, which downloads everything again. I couldn't get the paths to work pointing at already downloaded ckpt or clip model python bin. I created a "data" folder for images and used relative paths:

python leap_textual_inversion --pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 --placeholder_token="<luke>" --train_data_dir=data --learning_rate=0.001

The error was then:

RuntimeError: The expanded size of the tensor (768) must match the existing size (1024) at non-singleton dimension 0. Target sizes: [768]. Tensor sizes: [1024]

SD1.5 does use 768 dimension tokens so this makes sense. I tried adding --tokenizer_name=openai/clip-vit-base-patch16 and --tokenizer_name=openai/clip-vit-base-patch32 and --tokenizer_name=openai/clip-vit-large-patch14 but the error was the same.


Related info: I'm windows, SD 2.1 training appears to work. In a command prompt window, it seems to work, with:

python leap_textual_inversion --pretrained_model_name_or_path=stabilityai/stable-diffusion-2-1-base --placeholder_token="<luke>" --train_data_dir=data --learning_rate=0.001 --max_train_steps 1

It downloads files, starts training, produced log files in /text-inversion-model. After training it then downloaded ~7GB more files, then finally does "saving embedding" which appears in output folder.

I'm not entirely sure what to do with learned_embeds.bin though, it's ~200mb, much larger than a normal textual inversion embedding. Hope this helps someone anyway!

Thanks for sharing your results, I've run into the same pattern of errors. I also tried running SD 2.1 training on windows and got it working, although it appears to be using CPU? Training was exceptionally slow.

Hey everyone, I got a lot of questions about SD 1.5 and frankly, and ironically, that's where I started. The reason why I didn't share that model was because it was old and sub-par compared to right now.

Today I will release the training code as well as the dataset. Either you can train it yourself then, or I got so excited that I made my own which you can then use!

Hey @ddPn08 @Zyin055 @Luke2642 and @Grokstreet, I just finished cleaning up the training code and adding docs for it! It's currently in a separate branch (https://github.com/peterwilli/sd-leap-booster/blob/training/training/README.md), because the models it produces is incompatible with the currently released weights.

I tagged you all, so you can play with it already, and it'll likely be in the main branch tomorrow when I retrained my model. Consider it alpha state, but it's certainly slick and solid!

If any of you do manage to produce a good 1.5-1.4 model, please lmk so we can add it to this repo!

@peterwilli if I'm reading this correctly the trainer isn't set up for checkpoint resuming, correct?

Would that be possible? I'd love to work on a 1.5 model, but with my GPU resources, I doubt I could end to end the process (unless it's much faster than I'm imagining)

Hey lovely @AI-Casanova, we have just gotten major support from an unexpected corner, so we got some major firepower now. If you want, I can run the training for you in less than a few hours. I can't guarantee what the results will be, but it'll be fast to find out.

Ppl from Waifu Diffusion helped me to expand the database, it is now much bigger, I think I can trick stable diffusion to use that database with its own embeds, and train that instead. We also have LEAP 2.0 now that uses a new way to process images. While experimental, I got some really good results, sometimes with just 50 steps.

@peterwilli that would be phenomenal!

I know you are looking into LoRA in the future, and I had a somewhat related idea, using the raw output of the LEAP booster to inject directly into a text encoder for further processing with DreamBooth (TE training enabled)

@AI-Casanova I have a really interesting idea with LoRA actually, I think we can already begin testing with half the work, as a recent update has enabled us to. Going to live stream testing it out in a few minutes. If you like, you are free to join. Not sure yet what I'll be working on precisely it depends on what is done faster...

I can show you our new toys!

@peterwilli Thanks, such great work!

Do you think there could be a tool a bit like blip caption generation, that instead does some sort of evolutionary search for prompt generation using existing tokens?

It'd be crude, ineffective and slow, but I'm imagining it'd start with a blip generated caption and by evolutionary search, randomly change and jiggle the tokens to improve the match?

Or, perhaps use some gradient descent like textual inversion on one token at a time, and substitute a new token that more closely matches the calculated one?

This way it'd keep the natural human language link even though the prompt would end up looking really weird and unreadable! It'd basically be an image > text > image autoencoder so it is rather ambitious.

I also have only a poor conceptual understanding of how seeds work in textual inversion training, so what I'm suggesting could be completely unworkable. Does training effectively happen at CFG = 30 so the prompt dominates all seeds?

Hey @Luke2642 !

Do you think there could be a tool a bit like blip caption generation, that instead does some sort of evolutionary search for prompt generation using existing tokens?

Funny you mention it. This was my first approach! The answer is: yes, I think you can, and I tried, I never heard of BLIP, but my first try was with CLIP interrogation.

The main roadblock you will get is that gradient descent is very good at differential problems (kind of like when people say "you're warm, warmer, hot!" the closer you get to your objective).

Guessing tokens is a non-differential problem, because its unclear how right or wrong you are (or will be) by changing a single token to something else.

So there are 2 options:

  1. Turn guessing tokens into a differential problem (I guess that's what textual inversion is)
  2. Use optimizer that can do non-differential problems (evolutionary algorithms)

Eventually, I got out of this idea, because I realized that textual inversion already solved it for me by turning it in a differential problem (or at least, very close to the original idea).

Or, perhaps use some gradient descent like textual inversion on one token at a time, and substitute a new token that more closely matches the calculated one?

Could you clarify here? You mean you want to train a single concept on more than 1 word?

I also have only a poor conceptual understanding of how seeds work in textual inversion training, so what I'm suggesting could be completely unworkable. Does training effectively happen at CFG = 30 so the prompt dominates all seeds?

Don't put yourself down like that! The only stupid questions are no questions.

@peterwilli Thanks for explaining, I really appreciate it!

I was thinking of clip interrogation, I don't know why I said blip! So yes, the short version of my question is "can clip interrogation generated captions be improved by a new form of training?"

I do understand that tokens are discrete and jumping between them isn't differentiable, I was thinking only of an evolutionary search too.

However, in the 768 dimension embedding space, each token must have "nearest neighbours" by some measure?

So starting with just one image and one ~75 token prompt, the training could identify nearest neighbours for each token, substitute them in one at a time, and evaluate if the generated image is closer or further away to the target.

If the tokens have a "six degrees of separation" kind of property, this substitution process would work unexpectedly quickly!

I realise textual inversion is just "better" in every sense, and I've had good results with it, but a trained TI is not intelligible, it's just a black box, the link to the natural human language prompt is completely gone, all you can do is add/reduce attention with modifiers ( ) [ ] etc.

@Luke2642 Said another way, you're wondering if there's a way to take a textual inversion of a certain vector length, and translate it to discrete tokens of length+x. Am I understanding you correctly?

@AI-Casanova I wasn't, but the result would pretty much the same! So you're suggesting train a TI (continuous, gradient descent) first then a new process converts it into a long string of normal language words, discrete tokens, for "normal" prompting (even if they turn out to be 75 random words and word-parts of unintelligible gobbledegook!)

@Luke2642 rereading this when I'm not half asleep.

You're interested in creating a BLIP/CLIP substitute that's not agnostic to the internally embedded weights of the text encoder, and can create viable prompts instead of simple descriptions.

On the topic of nearest neighbors, have you tried https://github.com/tkalayci71/embedding-inspector ?

I find it quite useful for finding 'synonymous' tokens.

@AI-Casanova Or put more simply, a better method of turning any image or concept into a prompt that can be easily shared with just words!

Yes, the embedding inspector is fun to play with. It's amazing how many different concepts can be squeezed into to just one token.

Anyway, I doubt decomposing an embedding into a linear combination of existing tokens is a new idea. Let's wait and see if the real experts have any more ideas!

@Luke2642 oh awesome, thanks for the tag! Now to see if I can switch the implementation to 1.5

@Luke2642 Oh wow that is super cool! It's funny because that was my first idea with LEAP 😂 but I never got it to be better than Clip Interrogation.

Also, sorry everyone for lack of replies on this issue. I have been busy on LEAP+Lora, a lot has happened, and it's hard to look back on the current version when I'm already on the new one...

image

For fun, I tried my own selfie with it.

image

It is still impressive, but there's clearly room for both methods! ❤️

If the tokens have a "six degrees of separation" kind of property, this substitution process would work unexpectedly quickly!

You're right! I haven't tried that, yet. In my analysis, there were a few indexes in the embedding space that were often active, I think it's one for faces. So it's definitely separable in some way.

@peterwilli Oh absolutely, it's so crude, but a great proof of concept. The graph in the paper shows it works better with more, up to 64 tokens. Do you have the skills to script up a hyperparameter search in the colab? I'll have to read the paper more carefully. Playing with the JSON, it seems tolerant to a much lower LR, and weight decay, and bigger batches. I added pil image to use in colab, and it works with multiple images.

The order of prompt words normal affects generation too, not yet explored why? Do you understand the diffusion process enough to know?

So, pez could:

  • jiggle order
  • start with a clip interrogation, and add wild cards or additional tokens.
  • be extended with negative prompts.
  • use annealing of some from as well as decay, add extra weight to tokens that have survived > X steps

Just scratching the surface!

Things I plan to try, as soon as I get the chance:

Setting the embeddings equal to a token only every n steps instead of every one.

Returning the top n cosine similarities

If I see a pattern in the top cosine similarities, I'd want to freeze the ones in common, and train the rest, possibly to convergence? I'd need help with that though.

@AI-Casanova @AI-Casanova I usually code hyperparam search with Optuna, control is very fine-grained, but not the choice of algorithms (it uses a mix at will, from what I understood).

An example of this is here, and could be used in colab: https://github.com/peterwilli/sd-leap-booster/blob/lora-test-7/training/train_lora.py#L165

However, for such example, I'd suggest to stick with Pytorch, but swap the optimizer for something that is optimized for non-differential problems. Back when I was a naive tween, I used this: https://github.com/atgambardella/pytorch-es

I know it's old AF but hey you live a day you learn a day! Just kidding, there's probably more modern and popular repo's these days.

My point is, it generally gives us more control. I'd definitely be interested in trying this out with you guys especially since it's basically my original idea, but I gave up on it, no particular reason for it, I guess I found it too much effort to dig into old stuff again, or maybe I wanted to learn something new! If you're ok with it, I can make a repo, a starter notebook, perhaps dig up some old stuff before this repo, and add you all as co-owners.