Question about VAE adaption stage

Question

Question about VAE adaption stage

piddnad opened this issue 5 months ago · 7 comments

Thank you for sharing such impressive work!

I am particularly interested in the VAE adaption stage mentioned in the paper. It's mentioned that it was conducted at a 256x256 resolution.

I'm wondering, was this done by loading the PixArt-alpha pretrained weights from the High aesthetics stage, and then using to the 33M internal-sigma data for adaption? Is my understanding correct?

I have tried a training using PixArt-alpha 256-SAM weights, replacing SD1.5 VAE with SDXL VAE, training on SAM data and it seems to be difficult to converging in the short term (for now 10k steps), do you know what the problem might be?

Thank you in advance for your response.

Answer 1 · 2024-04-18T10:55:18.000Z

I'm wondering, was this done by loading the PixArt-alpha pretrained weights from the High aesthetics stage, and then using to the 33M internal-sigma data for adaption? Is my understanding correct?

This is correct.
Share some results of your training here.

Answer 2 · 2024-04-19T02:35:32.000Z

Here are some of my results. The resulting images seem to have some chunking and blurring:

prompt: A lovely young lady, with a smile on her face...

prompt: city skyline at night...

Answer 3 · 2024-04-19T03:49:00.000Z

which training and test code are you using?

Answer 4 · 2024-04-19T03:53:45.000Z

which training and test code are you using?

I'm using the training and validation code based on train_diffusers.py of PixArt-alpha.

Answer 5 · 2024-04-19T03:59:05.000Z

Check your vae and scale_factor. BTW, the training with diffusers is not stable. That's why we haven't changed all the code base to diffusres.

Answer 6 · 2024-04-28T04:59:04.000Z

Hello, I have carefully reviewed the vae code and used the original training code, and the results have not changed.

However, I have conducted a few more adaption experiments with the SDXL VAE, and there are some interesting findings to share.

The setups of the experiments were as follows:

Loading PixArt-256-SAM and training on SAM data (the initial experiment)
Loading PixArt-256-AES and training on SAM data
Loading PixArt-256-SAM and training on data similar to JourneyDB with high aesthetic scores
Loading PixArt-256-AES and training on data similar to JourneyDB with high aesthetic scores

Through these experiments, I observed some interesting phenomena: in terms of the final generated effect after the transfer, 4 > 3 ≈ 2 >> 1. Therefore, I speculate that both high-aesthetic pre-trained models and high-aesthetic data are beneficial to the VAE adaption.

Below are some visual examples: (from left to right: 4,2,3,1)

Answer 7 · 2024-05-17T03:28:34.000Z

Cool. Pretty interesting results. Thanks a lot for sharing.