Streaming intermediate images?

Question

Streaming intermediate images?

sabetAI opened this issue 2 years ago · 15 comments

Is it possible to publish an update of the model that supports streaming intermediate images during reverse diffusion ie with an iterator? Would greatly help UX if the user can see their image form while they're waiting for the process to finish.

Answer 1 · 2022-07-04T02:58:19.000Z

This isn't a diffusion model so that wouldn't work

Answer 2 · 2022-07-04T04:23:05.000Z

Diffusion models iteratively update the image over multiple steps. These iterates can be streamed out (ie see glide demo). 'Reverse diffusion' is simply the image generation step ('diffusion' is the noising process during training), which is what your model is doing during inference. Can you update the code to output intermediate images?

Answer 3 · 2022-07-04T04:29:08.000Z

Using the term 'reverse diffusion' might have caused some confusion with what I was asking.

Answer 4 · 2022-07-04T09:40:24.000Z

This model is not like glide or VQGAN+CLIP.
DALL-E works on a entirely different principle. The image is generated with tiny squares (tokens), square by square, from left to right and top to bottom. It does not change all image at once at every iteration like diffuse models. Every iteration it just fills another tiny bit of the empty area with the completely ready tiny portion of the final image.

Answer 5 · 2022-07-04T10:07:32.000Z

Ah good point @iScriptLex , I made assumptions about the model architecture. Even if it's outputting autoregressively, tokens can still be streamed out to incrementally update a canvas a pixel at a time. The main use-case here is so to show intermediate results to the user, as waiting kills the UX.

Answer 6 · 2022-07-04T10:42:28.000Z

It might be possible to generate the images each time a row of tokens is decoded, and use some kind of blank token for the missing rows

Answer 7 · 2022-07-04T14:25:32.000Z

@kuprel yes exactly. Also would it be more efficient just to stream rows of tokens and have the client handle everything else? Want to minimize latency that streaming may add.

Answer 8 · 2022-07-04T15:37:01.000Z

This model is not like glide or VQGAN+CLIP. DALL-E works on a entirely different principle. The image is generated with tiny squares (tokens), square by square, from left to right and top to bottom. It does not change all image at once at every iteration like diffuse models. Every iteration it just fills another tiny bit of the empty area with the completely ready tiny portion of the final image.

this would still look cool while it was loading but i worry about latency and bandwidth, wouldn't a loading bar or something work just as well?

Answer 9 · 2022-07-04T17:31:30.000Z

@w4ffl35 can you quantify marginal latency/bandwidth costs? Loaders may work for one-time uses, but users will churn if they're stuck looking at loaders 95% of the time. See urzas.ai for example of UX with intermediate outputs. Imo if a flag was made available it would be hugely valuable for devs.

Answer 10 · 2022-07-04T17:39:00.000Z

@sabetAI those are great points

Answer 11 · 2022-07-04T20:08:59.000Z

Ok I got it working in the colab now I just have to figure out how to get it on replicate. An intermediate image count of 8 only adds a couple seconds to the overall decoding time on the P100

Answer 12 · 2022-07-04T20:21:26.000Z

Here's what it looks like (open in new tab to see animation)

Answer 13 · 2022-07-04T22:52:38.000Z

@kuprel so good 👏. When can you merge 🙏?

Answer 14 · 2022-07-04T23:09:47.000Z

I merged it. You can try it in the colab. Hopefully will get it onto replicate by tomorrow

Answer 15 · 2022-07-05T13:03:55.000Z

Ok it's live on replicate now