Super resolution is a fundamental problem in computer vision, which aims at generating high-resolution images from low-resolution inputs. In recent years, deep learning approaches have shown promising results in addressing this challenge. In this project, we present a comprehensive review of deep learning-based super resolution methods, with a particular focus on GAN-based architectures such as SRGAN and ESRGAN. We discuss the key components of state-of-the-art super resolution networks, including feature extraction, feature fusion, and reconstruction. We also takcle the DIV2K dataset using these approaches while exploring the different training strategies and loss functions that have been proposed to enhance the performance of super resolution models. Furthermore, we provide an overview of benchmark datasets and evaluation metrics for our super resolution models.
The primary objective is to train a generative function
We introduce a discriminator network
\noindent The main idea of using such a formulation is to train a generative model
The main contribution introduced by authors of the SRGAN paper is the use of the GAN architecture and the introduction of a novel perceptual loss function that consists of an adversarial loss and a content loss based on VGG's feature maps. This loss function is proved to be much better than MSE at capturing perceptually relevant differences.
$$l^{SR}= \underbrace{6.10^{-3}l_{VGG/5,4}^{SR}}{content \ loss} + \underbrace{10^{-3} l{Gen}^{SR}}_{adversarial \ loss}$$
Where The generative loss
$$l^{SR}{Gen} = -\sum{n=1}^N \log D_{\theta_D}(G_{\theta_G}(I^{LR}))$$
Our approach consisted of implementing the SRGAN architecture proposed by the authors while making changes on the perceptual loss function and the training process.
Modified perceptual loss function for SRGAN:
Through intuition and experimentation, we introduce the following perceptual loss function:
$$l^{SR}{modified}= \underbrace{l^{MSE}{SR}}{pixel loss}+ \underbrace{6.10^{-3}l{VGG/5,4}^{SR}}{content loss} + \underbrace{10^{-3} l{Gen}^{SR}}{adversarial loss} + \underbrace{2.10^{-8}l{TV}^{SR}}_{total variation}$$
Intituition:
- We introduce MSE loss to penalize the differences in pixel space which ultimately leads to more accurate color fidelity between
$I^{SR}$ and$I^{HR}$ . By minimizing this difference, the super-resolved image can achieve accurate color fidelity with the ground truth image. In addition to the adversarial loss and perceptual loss, MSE loss can provide a more direct control of the image quality and is computationally efficient to compute. Therefore, the introduction of MSE loss in our model can lead to improved image quality and greater fidelity between the super-resolved image and the ground truth image in terms of pixel-wise similarity. - Inspired by style transfer GANs, we introduce TV loss to reduce noise in
$I^{SR}$ . TV loss acts as a regularization term that measures the overall variation of intensities in the image, promoting spatial smoothness and reducing noise. In the context of SISR, the introduction of TV loss can improve the quality of the super-resolved image by reducing the noise that may arise from the low-resolution input image. However, since TV loss can also lead to smoothing of textures in the image, we used a very low weight its loss component to avoid crashing textures during the super-resolution process. This allows for better preservation of the high-frequency details in the super-resolved image, while still benefiting from the noise reduction provided by the TV loss.
Instead of pre-training the Generator network using MSE loss like the authors did, we experimented with a weighted MSE and
- LR_CROPPED_SIZE = 24: The size of the low-resolution cropped input image used for training.
- UPSCALE = 4: The upscaling factor used to generate the high-resolution output image from the low-resolution input image.
- HR_CROPPED_SIZE = UPSCALE * LR_CROPPED_SIZE: The size of the high-resolution cropped output image used for training.
- BATCH_SIZE = 16: The number of image samples in each mini-batch during training.
- EPOCHS = 50: The number of epochs (i.e., complete passes through the training dataset) used for training.
- LR = 0.0001: The learning rate used for training the model.
- BETAS = (0.5, 0.9): The values of the beta1 and beta2 hyperparameters used by the Adam optimizer during training.
- adversarial_loss_coef = 0.001: The coefficient for the adversarial loss used during training.
- vgg_loss_coef = 0.006: The coefficient for the VGG loss used during training.
The models were trained on the DIV2K dataset that contains:
- 1600 training images of different sizes divided into 800 high resolution images and their corresponding x4 down-scaled and bicubic-interpolated low resolution images.
- 200 test images divided into 100 high resolution images and their corresponding low resolution counterparts.
In addition to DIV2K's test set, we also evaluated our models on the Set5 and Set14 benchmark datasets.The performance of the models was evaluated using various quantitative metrics such as peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and visual quality.
Using our own implementation of the discussed quality metrics, we evaluate and compare our SRGAN with the results of the original SRGAN paper and our ESRGAN.
From the qualitative results displayed in the table below, we can see that our SRGAN model compares fairly well with the results of the authors of the original SRGAN paper who used a much larger training set (350k images sampled from ImageNet) and trained their model for much longer. We can also see the superiority of the ESRGAN model in each case. Indedd the enhanced architecture and the subtle but important technical modifications the authors introduced seem to be improve the results quantitatively. However, it is unclear to me if this improvement over previous methods is mostly thanks to the subtle architectural and theoretical changes or the sheer fact that the generator network of ESRGAN is much deeper and involves much more convolutions.\
Note that for ground-truth HR images,
SRGAN (ours) | SRGAN (Ledig et al) | ESRGAN | |
---|---|---|---|
DIV2K | |||
PSNR | 25.373 | - | 28.174 |
SSIM | 0.706 | - | 0.775 |
Set5 | |||
PSNR | 24.945 | 29.40 | 30.474 |
SSIM | 0.718 | 0.8472 | 0.851 |
Set14 | |||
PSNR | 23.770 | 26.02 | 26.614 |
SSIM | 0.636 | 0.7397 | 0.713 |
Qualitative results are in accordance with the results we got using the analytical quality measures. In fact, our qualitative results demonstrate the ability of SRGAN in retrieving fine details and textures that were missing in the original low-resolution images. The high-resolution images produced by SRGAN show sharp and well-defined edges and feature relatively smooth transitions in color and texture. Overall, our results confirm the effectiveness of proposed perceptual loss and pre-training procedure. However, the results are not perfect or as good as ESRGAN's. This is expected since we only trained our model for 850 epochs on a relatively small dataset. \ On the other hand, ESRGAN exhibits very good performance in producing super-resolved images with rich details and textures. The generated images have high perceptual quality (from a human's point of view) and surpass the ones produced by SRGAN. The enhanced performance of ESRGAN can be attributed to its more robust discriminator that utilizes a relativistic approach to evaluate the realism of the generated images, the subtle changes made in the loss function and architecture or the fact that it is a much more deeper network.
we assessed the effectiveness of SRGAN and ESRGAN (mostly SRGAN) on the DIV2K dataset and other benchmarking sets. Both networks showed improvement over the traditional interpolation methods as well as previous state-of-the-art super-resolution methods. The use of perceptual loss and deep residual learning in both SRGAN and ESRGAN have proven to be effective in super-resolving images, and the use of pre-trained VGG network in SRGAN has shown to be effective in capturing the high-level feature similarities between the super-resolved and ground truth images. However, the comparison between the two networks has shown that ESRGAN performs better than SRGAN in terms of both quantitative metrics (PSNR, SSIM) and visual quality.
Moreover, using a modified version of the perceptual loss function proposed by Ledig et al, our SRGAN model performs reasonably well on the benchmark datasets and so does ESRGAN, especially given the relatively low number of epochs that we used given that SR methods and GAN-based models in generals require a lot of training time . We intend to continue with the training of ESRGAN on DIV2K and re-traing SRGAN for longer time and with a heavy augmentation pipeline as for this project, no augmentations were applied except a center crop. Allso exploring different datasets can prove to benificial. We also intend to explore multi-frame super resolution methods like the one presented google's burt super-resolution work. These methods rely on fusing images of the same sceneg enerated by hand movements or intentional camera hardware movements, and merge their information to obtain a higher resolution image.