Questions about implementation
bonlime opened this issue · 4 comments
Hi, first of all thanks for a useful library. I've been looking into your implementation of prompt weighting and have questions about it. (i'm only interested in get_embeddings_for_weighted_prompt_fragments
function, without blending and etc).
- if you have separate function for handling weights < 1, why in the first call to
build_weighted_embedding_tensor
this weights are also used? - the logic for handling negative cases makes much more sense to me, why not to adapt the same for positive weighs?
i've tried to change your implementation by adapting similar strategy for weights > 1 and it seems to give much more consistent results.
there is another implementation suggestion. currently you're calculating embedding_without_this
by removing the weighted piece. it leads to significant change of the whole final embedding. i've observed that if instead you mask the tokens by passing attention_mask
to text_encoder
the embedding in general is changes less, giving more precise "direction" of the change.
currently you're calculating embedding_without_this by removing the weighted piece
thanks for the suggestion - yes, i've since become aware of this and it's on the roadmap to change at some point. i did not have much luck using attention_mask
(cf huggingface/diffusers#1890), but i was going to try substituting <|pad|>
tokens for the omitted tokens instead. but do you have a working example you could share? perhaps a pull request i could merge in?
however I'm not sure i understand the first two questions -
the logic for handling negative cases makes much more sense to me, why not to adapt the same for positive weighs?
which "negative cases" do you mean?
if you have separate function for handling weights < 1 ...
for 1. there isn't a separate function - what's happening here is a blend, so for example
a cat playing with a (ball)0.8
is (roughly) to ("a cat playing with a ball", "a cat playing with a").blend(1, 0.8)
and the ball
in the first part of the blend has its weight multiplied by 0.8. the weighting 0.8
is applied to build base_embedding
, and then an additional embedding without ball
is constructed and blended with that.
ok i understand what you meant with the mask now. that makes a lot of sense, i'll try and get it in for the next release.
since compel v1.0.0 downweighting now masks rather than removes tokens by default - thanks for the suggestion.
@damian0815 Glad to see you have adapted the suggestion so fast!
but i was going to try substituting <|pad|> tokens for the omitted tokens instead
this is an interesting idea. since i wrote you i found that while masking is working much better than removing the part of the prompt, one property which is not preserved is that setting weight 0 leads to image result being different from just removing the prompt. so maybe your approach with <|pad|>
may me better, have you experimented with it?
also i was thinking about hacky things like calculating embedding for empty prompt "", taking average of it (token-wise) and using the result as substitute for masked tokens. but this is just a thought, i haven't tried it yet