/ComfyUI-sudo-latent-upscale

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

ComfyUI-sudo-latent-upscale

This took heavy inspriration from city96/SD-Latent-Upscaler and Ttl/ComfyUi_NNLatentUpscale. Directly upscaling inside the latent space. Some models are for 1.5 and some models are for SDXL. All models are trained for drawn content. Might add new architectures or update models at some point. I recommend the SwinFIR or DRCT models.

1.5 comparison: comparison

SDXL comparison: comparison

First row is upscaled rgb image from rgb models before being used in vae encode or vae decoded image for latent models, second row final output after second KSampler.

Training Details

I tried to take promising networks from already existing papers and apply more exotic loss functions.

Further Ideas

Ideas I might test in the future:

  • Huber
  • Different Conv2D (for example MBConv)
  • Dropout prior to final conv

Failure cases

  • Any kind of SSIM introduces instability. I tried to do 4 channel SSIM and MS-SSIM, also SSIM on vae decoded image and nothing works. nonnegative_ssim=True does not seem to help as well. Avoid SSIM to retain stability.
  • Using vae.config.scaling_factor = 0.13025 (do not set a scaling factor, nnlatent used it and city96 didn't, I do not recommend to use it), image range 0 to 1 (image tensor is supposed to be -1 to 1 prior to encoding with vae) and not using torch.inference_mode() while creating the dataset. A combination of these can make training a lot less stable, even if loss goes down during training and does seemingly converge, the final model won't be able to generate properly. Here is a correct example:
vae = AutoencoderKL.from_single_file("vae.pt").to(device)
vae.eval()

with torch.inference_mode():
  image = cv2.imread(f)
  image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
  image = (
      torch.from_numpy(image.transpose(2, 0, 1))
      .float()
      .unsqueeze(0)
      .to(device)
      / 255.0
  )
  image = vae.encode(image*2.0-1.0).latent_dist.sample()
  • DITN and OmniSR looked like liquid with their official sizes. Not recommended to use small or efficient networks.

  • HAT looked promising, but seemingly always had some kind of blur effect. I didn't manage to get a proper model yet.

hat

  • I tried to use fourier as first and last conv in DAT, but I didn't manage to properly train it yet. Making the loss converge seems hard.

fourier

  • GRL did not converge.

grl

  • SwinFIR with Prodigy 1 and Prodigy 0.1 caused massive instability. Images from my Prodigy 1, l1 and EfficientnetV2-b0 attempt.

graphs

swinfir_prodigy