kuprel/min-dalle

Tokens: Relationship between tokens wrt image output

dza6549 opened this issue · 2 comments

Hi kuprel

Thank you making your work public.

Can you please describe how the tokenizer works? I look at the logs and I guess that the text prompt is being parsed one word at a time. I can see that some words are being split into two or three tokens. Then the array of tokens is converted to a single array? and this array is fed to an encoder?

Are you able to shed some light on the relationships between the tokens in the 'bag of tokens' please? For example, if I use 'Salvador Dali' in my prompt does the tokenizer break that phrase into single words so that 'Dali' is seen as more important than 'Salvador'? Or does the model see the word pair of 'Salvador' + 'Dali' and weight this word pair to influence the output image.

Thank you again

Cheers

If you put a print(subwords) statement here, it will be more clear what the tokenizer is doing. Essentially it greedily merges neighboring subwords as long as the merge pair is contained in merges.txt. A text prompt is tokenized into a sequence of tokens (not a bag of tokens). This sequence is then fed into a fine-tuned BART transformer model which generates a sequence of image tokens. These image tokens are then detokenized with a VQGAN into an image.

Thankyou kuprel for taking the time to explain this to me. I'm using the replicate version so I don't think I can check merges.txt but I think I get the idea - the order of the tokens matters and we get a different 'type' of image if we re-order the same finite set of tokens. I will do some experiments and confirm this conjecture for myself. Thanks kuprel, this is a lot of fun to play with! Cheers