Real-World Image Super Resolution via Unsupervised Bi-directional Cycle Domain transfer Learning based Generative Adversarial Network
Import Disclaimer Note
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
Deep Convolutional Neural Networks have exhibited impressive performance on image super-resolution by reconstructing a high resolution image from a low resolution image. However, most state-of-the-art methods heavily rely on two limited properties where the training LR and HR images are paired and artificially pre-determined by known degradation kernel (i.e., bicubic downsampling ) to train the networks in the fully supervised fusion. As a result, existing methods fail to deal with real world super resolution tasks, since the paired LR and HR images in real world scenes are typically unavailable and degraded by the complicated and unknown kernel. To break these restrictions, in this paper, we propose the Unsupervised Bi-directional Cycle Domain Transfer Learning-based Generative Adversarial Network (UBCDT-GAN), which has the ability to super-resolve HR image from the real world LR image with complex and inevitable sensor noise in an unsupervised manner. Our proposed method consists of an Unsupervised Bi-directional Cycle Domain Transfer Network (UBCDTN) and Semantic Encoder guided Super Resolution Network (SESRN). Firstly, the UBCDTN is able to produce approximated real-like LR image through transferring the LR image from artificial degraded domain to the real-world LR image domain with natural characteristics. Secondly, the SESRN takes the approximated real-like LR image as input and super-resolves it to a photo-realistic HR image. Extensive experiments on unpaired real-world image benchmark datasets demonstrate that the proposed method achieves promising performance compared to state-of-the-art methods. The overview of our method is shown in Fig.1
Figure 1. The overview of the proposed UBCDT-GAN: In the first stage (left), the green dot rectangle represents Unsupervised Bi-direction Cycle Domain Transfer Network (UBCDTN). The red path indicates the forward cycle module, given the input HR image
The main contributions of our proposed method can be summarized as follows:
-
We propose a novel bi-directional cycle domain transfer network, UBCDTN. According to the domain transfer learning scheme, the designed bi-directional cycle architecture is able to eliminate the domain gap between the generated real-like LR images and real-world images in an unsupervised manner.
-
We further imposed the auxiliary constraints on the UBCDTN by incorporating adversarial loss, identity loss, and perceptual loss, which can guarantee that the real-like LR images contain the same style as real-world images.
-
We adopted the previous proposed SESRN as a deep super resolve network to generate visually pleasant SR results under the supervised learning settings.
-
Benefiting from the collaborative training strategy, the proposed UBCDT-GAN is able to train in an end-to-end fashion, which results in easing the entire training procedure and strengthen the robustness of the model.
In this section, all the details of the proposed UBCDT-GAN will be described. We will first introduce our method, which mainly consists of two networks. The first network Bi-directional Cycle Domain Transfer Network (BCDTN) aims to perform domain translation operation on two different domain image sets. It contains forward cycle module and backward cycle module, and the pipeline is shown in Fig 1. The second network is SESRN consists of Semantic Encoder (SE), Joint Discriminator
Figure 2. The proposed SESRN and its components: Semantic Encoder
In the unsupervised super resolution problem, given a set of HR images and unpaired real-world LR images with unknown degradation, we denote HR images as
In the first stage, the proposed UBCDTN aims to transfer the domain of artificially degraded images to the real world domain, which can ensure a similar real-world LR pattern in the generated LR images. As shown in Fig 1, the red path indicates the forward-cycle module, and the backward-cycle module is represented as the blue path. The forward-cycle module comprises of generator
We first design a Unsupervised Bi-directional Cycle Domain Transfer Network (UBCDTN). The specific network architecture of generators
The UBCDTN simultaneously trains two generators, where these two generators should be translated in bi-direction and inverted each other. We involve adversarial learning in both modules, where
The pipeline of forward-cycle module is shown in blue path of UBCDTN in Figure 5.1. The forward-cycle module contains a generator
$\widehat{I}{recon}^{LR} = G{B}(G_{A}((I^{LR}{degraded}){i}))$
$L^{cyc}{G{B}}(G_{A},G_{B},I^{LR}{degraded}) = \frac{1}{N}\sum{i}^{N}||(\widehat{I}^{LR}{recon}){i} - (I^{LR}{degraded}){i} ||_{1}$
where $\widehat{I}{recon}^{LR}$ represents the reconstructed LR images generated by $G{B}$. As Equation 5.2 shows, with the help of
$X_{real} = (I^{LR}{real},\widehat{I}{real}^{LR})$
$X_{fake} = (\widehat{I}{real}^{LR},I^{LR}{real})$
where
$L^{adv}{G{A}}(G_{A},D_{B},I^{LR}{real},\widehat{I}{real}^{LR})= -\mathbb{E}{I^{LR}{real}}\sim p_{(I^{LR}{real})} [ log ( 1- D(X{real})) ]-\mathbb{E}{\widehat{I}{real}^{LR}}\sim p_{(\widehat{I}{real}^{LR})}[ log ( D( X{fake} ) ) ]$
$L^{adv}{D{B}}(G_{A},D_{B},I^{LR}{real},\widehat{I}{real}^{LR})= -\mathbb{E}{I^{LR}{real}}\sim p_{(I^{LR}{real})} [ log ( D(X{real})) ]-\mathbb{E}{\widehat{I}{real}^{LR}}\sim p_{(\widehat{I}{real}^{LR})}[ log ( 1-D( X{fake} ) ) ]$
where
$\widehat{I}{idt}^{LR} = G{B}((I^{LR}{degraded}){i})$
$L^{idt}{degraded_LR}(G{B},I^{LR}{degraded}) = \frac{1}{N}\sum{i}^{N}||(\widehat{I}^{LR}{idt}){i} - (I^{LR}{degraded}){i} ||_{1}$
Moreover, in order to minimize perceptual divergence between $\widehat{I}^{LR}{recon}$ and $I^{LR}{degraded}$, we utilze
$L^{percep}{FE{A}}(FE_{A},G_{A},G_{B},I^{LR}{degraded})= \frac{1}{N}\sum{i}^{N}||FE_{q,r}(G_{B}(G_{A}((I^{LR}{degraded}){i})))- FE_{q,r}((I^{LR}{degraded}){i})||_{2}$
where
$L^{Forward}{total}(G{A},G_{B},D_{B},FE_{A},I^{LR}{degraded},I^{LR}{real}) = \omega_{1}L^{adv}{G{A}}+\omega_{2}L^{cyc}{G{B}}+\omega_{3}L^{idt}{degraded_LR}+\omega{4}L^{percep}{FE{A}}$
where the hyper-parameters
To transfer image from target domain to source domain, i.e., $I^{LR}{real} \rightarrow I^{LR}{syn}$, we specific construct backward-cycle module in which the identified generator
$I_{recon}^{LR} = G_{A}(G_{B}((I^{LR}{real}){i}))$
$L^{cyc}{G{A}}(G_{A},G_{B},I^{LR}{real})= \frac{1}{N}\sum{i}^{N}||(I^{LR}{recon}){i}) - (I^{LR}{real}){i} ||_{1}$
where
$X_{real} = (I^{LR}{degraded},I{syn}^{LR})$
where
$L^{adv}{G{B}}(G_{B},D_{A},I^{LR}{real},I^{LR}{syn})= -\mathbb{E}{I^{LR}{degraded}}\sim p_{(I^{LR}{degraded})} [ log ( 1- D(X{real})) ]-\mathbb{E}{I{syn}^{LR}}\sim p_{(I_{syn}^{LR})}[ log ( D( X_{fake} ) ) ]$
$L^{adv}{D{A}}(G_{B},D_{A},I^{LR}{real},I^{LR}{syn})= -\mathbb{E}{I^{LR}{degraded}}\sim p_{(I^{LR}{degraded})} [ log ( D(X{real})) ]-\mathbb{E}{I{syn}^{LR}}\sim p_{(I_{syn}^{LR})}[ log ( 1-D( X_{fake} ) ) ]$
where $\mathbb{E}{I^{LR}{degraded}}\sim p_{(I^{LR}{degraded})}$ and $\mathbb{E}{I_{syn}^{LR}}\sim p_{(I_{syn}^{LR})}$ indicates the distribution of real image $I^{LR}{degraded}$, and the fake image $I^{LR}{syn}$ respectively. According to Equation 5.20 and Equation 5.21, The
$I_{idt}^{LR} = G_{A}((I^{LR}{real}){i})$
$L^{idt}{real_LR}(G{A},I^{LR}{real})= \frac{1}{N}\sum{i}^{N}||(I^{LR}{idt}){i} - (I^{LR}{real}){i} ||_{1}$
Moreover, the backward perceptual loss $L^{percep}{FE{B}}$ is calculated to recover visual pleasing details. We utilize
$L^{percep}{FE{B}}(FE_{B},G_{A},G_{B},I^{LR}{real})= \frac{1}{N}\sum{i}^{N}||FE_{q,r}(G_{A}(G_{B}((I^{LR}{real}){i})))- FE_{q,r}((I_{real}^{LR}){i})||{2}$
where
$L^{Backward}{total}(G{A},G_{B},D_{A},FE_{B},I^{LR}{real},I^{LR}{degraded}) = \lambda {1}L^{adv}{G_{B}}+\lambda_{2}L^{cyc}{G{A}}+\lambda_{3}L^{idt}{real_LR}+\lambda{4}L^{percep}{FE{B}}$
where the hyper-parameters
The full optimization objective loss for UBCDTN consists of all the losses presented in the above sections. It is an addition of forward cycle module loss $L^{Forward}{total}$ and backward cycle module loss $L^{Backward}{total}$. It can be represented as follows:
$L^{UBCDTN}{total} = L^{Forward}{total} + L^{Backward}_{total}$
Finally, we adopt the
In this section, we first present the architecture of
Figure 3. The architecture of
The architecture of
Figure 4. The architecture of discriminator
The architecture of
Figure 5: The architecture of
In the UBCDTN, we design two feature extractor
In this section, we demonstrate how to generate the desired
Figure 6. Red dotted rectangle: The architecture of the Generator. Blue dotted rectangle: The architecture of the Joint Discriminator.
As shown at the top of Figure 6, the generator
where
where
where
where
where
Figure 7. Top: The architecture of Dense Nested Block (DNB). It consists of multiple RIDBs. Bottom: The architecture of proposed Residual in Internal Dense Block (RIDB).
As mentioned in Section 5.4.9, the novel architecture RIDB is proposed for the generator, which is used to form the DNB (as shown in Figure 7). The RIDB utilized in SESRN is similar to RIDB in SEGA-FURN. However, as for DNB, in order to deal with real world problem, we enhance the feature extraction ability of DNB, increasing the RIDBs from 3 to 4 in each DNB to strengthen the flow of hierarchical features through deep DNBs. The details of RIDB can be referred to in section 3.3.2 of Chaptor\ 3. Overall, thanks to the DNB and RIDB, the generator of SESRN is able to extract hierarchical features from the input
In the SESRN, we also introduce the Semantic Encoder to extract semantic embedded (as shown in Figure 2), which is used to project input visual information (LR, HR) back to the latent space. From the experiment results, the semantic encoder still plays the important role in the SESRN. The semantic encoder is able to capture the useful semantic attributes reflected by the input image, which is beneficial to supervise discriminative problems.
As for the classical GAN-based SR models URDGN \cite{URDGN}, SRGAN \cite{SRGAN}, ESRGAN \cite{ESRGAN_Wang}, they lack an ability that is capable of reversing visual image information (LR, HR) to the semantic latent representation \cite{BiGAN} even though they are good at mapping latent representations to image data distributions. Thus, we argue that the critical missing property of these methods is that they only exploit visual information (LR, HR) in their model, as the input during the discriminative procedure, ignoring the high-level semantic information reflected by latent representation. Previous GAN's work \cite{BiGAN,ALI} has proved that the resulting embedded semantics learned from the data distribution is beneficial to the discriminator to distinguish real and fake samples. Thus, in the SESRN, we introduce the semantic encoder to inversely map the real- world image distributions (HR and LR images) back into the latent representation. The same as SEGA-FURN, we named semantic latent representation extracted by the semantic encoder as an embedded semantics, in which it is able to reflect the image structures and attributes. In this case, the embedded semantics along with corresponding visual information (HR and LR images) are fed into the joint discriminator as a joint input tuple, in which it can be seen as the "label" for corresponding images. We utilize the VGG16 network pre-trained by imagenet as the semantic encoder. For the purpose of satisfying the different dimensions of HR and LR images, we adopt two Semantic Encoders which have the same structure but different input dimension (as shown in Figure 2) to obtain embedded semantics from different convolutional layers respectively. Embedded semantics leads to optimize the adversarial process of the generator and the joint discriminator, which drive SESRN to reconstruct the details of the super-resolved image accurately.
As shown in Figure 6, the tuple incorporating both visual information and embedded semantics is fed into the joint discriminator as the input, where Embedded Semantics-Level Discriminative Sub-Net (ESLDSN) aims to identify the embedded semantics whether it comes from the HR images while Image-Level Discriminative Sub-Net distinguishes the input image from HR image dataset or generator. Next, in the Fully Connected Module (FCM), we concatenate two vectors, and then the final probability is predicted. Thanks to this property, the joint discriminator is capable of learning the joint probability distribution of image data (
where
$L_{D}^{RaLS}=\mathbb{E}{I^{HR}\sim p{(I^{HR})}}[( \tilde{C}( X_{real})-1)^{2}]+\mathbb{E}{I^{SR}\sim p{(I^{SR})}}[( \tilde{C}( X_{fake})+1)^{2}]$
$L_{G}^{RaLS}=\mathbb{E}{I^{SR}\sim p{(I^{SR})}}[( \tilde{C}( X_{fake})-1)^{2}]+\mathbb{E}{I^{HR}\sim p{(I^{HR})}}[( \tilde{C}( X_{real})+1)^{2}]$
where
In SESRN, we further leverage the pre-trained VGG19 network as content extractor
We introduce content loss
Content Loss
$I^{SR} = G_{SR}((\widehat{I}{real}^{LR}){i})$
where
Pixel-wise Loss
The pixel-wise loss is widely applied to optimize image super-resolution tasks. In our method, we involve pixel-wise loss to enforce the intensity similarity between the super-resolved image
$L_{pixel} = \frac{1}{N}\sum_{i}^{N}||G_{SR}((\widehat{I}^{LR}{real}){i}) - (I^{HR}){i} ||{2}$
$L_{pixel} = \frac{1}{N}\sum_{i}^{N}||(I^{SR}){i} - (I^{HR}){i} ||_{2}$
where
Total Loss
Finally, we obtain the total loss function
$L^{SESRN}{total} = \lambda {con}L{content} + \lambda{adv}L_{G}^{RaLS}+ \lambda_{pixel}L_{pixel}$
where
Full objective loss for UBCDT-GAN
Finally, we formulate the full objective loss for the UBCDT-GAN, which is the combination of $L^{UBCDTN}{total}$ and $L^{SESRN}{total}$. By incorporating the losses previously defined, the final objective loss can be defined as:
$L^{UBCDT-GAN}{total} = L^{BCDTN}{total} + L^{SESRN}_{total}$
The complete objective loss
In this section, we first present the datasets and details used for our experiments. Second, we evaluate the quantitative and qualitative performance of the proposed UBCDT-GAN by comparing with several state-of-the-art SISR methods.
Figure 8. The sample images of NTIRE_2020_T1 validation dataset. The top row presents HR images (256 × 256 pixels) and the bottom row shows corresponding LR images (64 × 64 pixels).
In the training stage, in order to enrich our training dataset, we conducted experiments on the DF2K dataset \cite{DIV2K, EDSR}, which is a merge of the DIV2K dataset and Flickr2K dataset. The DIV2k dataset contains 800 high-quality (2K resolution) images with a large diversity of contents, which is used for the NTIRE 2017 and NTIRE 2018 Real World Super Resolution Challenges. As for the Flickr2K dataset, it includes 2650 2K images collected from the Flickr2K website. Specifically, as for the LR images, we introduce the real-world LR images from DIV2K NTIRE 2017 unknown degradation 4× dataset, where all the LR images are downgraded with unknown degradation, resulting in sensor noise, compression artifacts, etc. The Flickr2K LR image dataset comes from NTIRE 2020 Real World Track 1 training source dataset. All the LR images are corrupted with unknown degradation kernel and downsampled 4× by an unpredicted operator so as to satisfy with real-world conditions. Since the goal of our method is to solve unsupervised super resolution problem without LR-HR paired image, we select the first 1725 images (number: 1-1725) from the DF2K HR dataset as our HR training dataset, and the LR training dataset is formed by the other 1725 images (number: 1726-3450) obtaining from DF2K real-world LR dataset. Overall, our method is trained on such an unpaired real-world LR-HR dataset.
To evaluate the proposed method on real-world data, in the testing stage, we use the validation dataset from the NTIRE 2020 Real World SR challenge Track 1. This dataset contains 100 testing LR images (scaling factor: 4×), where all LR images are processed with unknown degradation operation to simulate the realistic artifacts and natural characteristics. As shown in Figure 8, we present some simple images from NTIRE_2020_T1 validation dataset. In order to compare the qualitative and quantitative results fairly, we use the same validation dataset for all experiments.
At the training stage, instead of random initializing model weights, we pre-train the UBCDTN and SESRN in the first and second steps, and then we jointly train the whole proposed method in an end-to-end manner. The training procedure is divided into three steps. We first train the UBCDTN with unpaired artificially degraded images $I^{LR}{degraded}$ and real world images $I^{LR}{real}$, which aims to transfer the LR image from artificially degraded LR domain to the real world LR domain. Second, we pre-train the SESRN using the approximated real-like images $\widehat{I}^{LR}{real}$ and its HR version $I^{HR}$ to generated realistic super-resolved images $I^{SR}$. As for pre-train UBCDTN and SESRN, we use the same optimization strategy where the Adam optimizer \cite{Adam} is applied to train both networks by setting $\beta{1}=0.9$,
To quantitatively compare the performance of different methods, we utilize the mainstream distortion based metrics Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM) to evaluate the quantitative results.
We compare our method with other real-world super-resolution methods on the NTIRE 2020 Track 1 real-world dataset quantitatively and qualitatively. The Figure 9 and 5.10 present the qualitative comparisons of the proposed method with other state-of-the-art methods. In Table 1 and Table 2, we further provide the evaluation of quantitative results between our method and compared methods. It is emphasized that we present the quantitative evaluation of all the state-of-the-art methods, which are cited from their official published paper. Moreover, as for qualitative comparison, we directly download released codes and pre-trained models of the compared methods. Then, we carefully re-implement referenced methods on the same testing dataset to obtain effective results.
Table 1. Quantitative comparison on NTIRE 2020 Real World Super-Resolution Challenge Track 1 validation dataset of the proposed method against participating methods, in terms of average PSNR (dB) and SSIM for upscale factor 4×. The bold results indicate the best performance.
Table 2. Quantitative comparison on NTIRE 2020 Real World Super-Resolution Challenge Track 1 validation dataset of the proposed method against state-of-the-art methods. The bold results indicate the best performance.
To validate our proposed method, we utilize image quality criteria PSNR and SSIM in the experiments. For a fair comparison, we evaluate all of the compared methods on the real world images from the NTIRE 2020 Track 1 validation dataset, where all the validation images are corrupted with unknown degradation, resulting in sensor noise and image processing artifacts. The quantitative results of all methods are reported in Table 5.1 and Table 5.2. As for Table 5.1, we show the quantitative comparison between our method and other method participating in NTIRE 2020 Challenge. For Table 5.2, we list the results obtained from several methods trained on bicubic images in a supervised manner. Specifically, as shown in Table 5.1, our method has promising superiority, achieving the highest 26.83dB/0.789 in terms of PSNR/SSIM values. The results obtained from BMIPL-UNIST-YH-1 Team places the second-best 26.73dB/0.752, in which our method improves the PSNR/SSIM values of 0.1dB/0.037 over their method. Although the method provided by BMIPL-UNIST-YH-1 Team involved a cycle mapping scheme, which is the same basic idea as ours, they only introduce cycle constraint without other additional constraints such as an adversarial loss and the perceptual loss, resulting in poor cycle learning mechanism. In addition, they simply used RCAN \cite{RCNN} as the super-resolution model to generated SR images. In a word, benefiting from incorporating UBCDTN and SESRN, our method has the powerful super-resolution ability to generated high quality SR images, thereby achieving better quantitative results than their method. Particularly, it is noticeable that our method improves a large performance enhancement over other participating methods.
As shown in Table 2, we further compare with widely referenced state-of-the-art methods which are trained on bicubic data. In this case, our method still achieves the best performance among other methods. In addition, it is noticeable that our method outperforms the second-best method SRFBN \cite{SRFBN} by the large margin of 1.46dB/0.147 in terms of PSNR/SSIM. EDSR \cite{EDSR} and SRFBN \cite{SRFBN} cannot handle real-world SR tasks well since they are merely trained on simple degradation images. Moreover, we found that the ESRGAN \cite{ESRGAN_Wang} and SRGAN \cite{SRGAN} have worse performance in terms of PSNR/SSIM among all other methods. Besides, from the Figure 9 and 10, the visual results of ESRGAN \cite{ESRGAN_Wang} and SRGAN \cite{SRGAN} also show over-smoothed textures and unrealistic artifacts. Interestingly, we found such phenomenon has also been presented in \cite{CinCGAN,ZSSR,USISResNet}. The underlying reason is that these methods ignore the domain distribution difference caused by bicubic degradation and only take the simple and clean LR images produced by bicubic as the input during the training phase. Thus, this analysis is able to demonstrate the necessity of real world SR methods in real world practical conditions without clean LR images. Thanks to the proposed UBCDTN and SESRN, our method has the ability to solve domain distribution shift problem when dealing with real world LR images in real-world scenes. Overall, from the above analyses, it is obvious that our method boosts a huge performance improvement over compared methods, indicating the effectiveness of the proposed method.
The visual comparisons are provided in Figure 9 and 10. To comprehensively evaluate the performance of the proposed method, we compare with various SR methods such as Bicubic, Nearest Neighbor, SRGAN \cite{SRGAN}, ESRGAN \cite{ESRGAN_Wang}, CycleGAN \cite{CycleGAN}, and ZSSR \cite{ZSSR}.
To be specific, we first compare our method with the traditional interpolation-based SR method, Bicubic and Nearest Neighbor, which utilize mathematical techniques to recover HR image from LR image. Moreover, we also introduce two NTIRE 2020 Challenge baseline GAN-based SR methods, SRGAN and ESRGAN, which are designed for a supervised manner and trained on Bicubic downsampled image. In addition, we further explore the latest representative unsupervised methods, CycleGAN and ZSSR. The CycleGAN is an unpaired image translation method, which can translate LR image from the source domain to the target HR domain. As for ZSSR, it merely takes a single LR image as input during training and testing stages and learns to explore internal image information to reconstruct the given LR image.
Figure 9. Qualitative comparison of visual results with state-of-the-art methods on NTIRE 2020 Real World Track 1 image "0887", "0822", "0821". Our method produce photo-realistic results.
Figure 10. Qualitative comparison of visual results with state-of-the-art methods on NTIRE 2020 Real World Track 1 image "0891", "0820", "0892". Our method produce photo-realistic results.
As shown in Figure 9 and 10, we give several SR results on validation images from the NTIRE 2020 dataset. Since these LR images degrade by the unknown kernel to simulate real world conditions, the LR images containing sensor noises is severely blurry and unreal. As for two traditional methods, it is obvious that the results of Bicubic and Nearest Neighbor lack high-frequency contents, resulting in overly smooth edges and coarsely textures. Regarding to the SRGAN and ESRGAN, these two methods slightly remove undesirable noises compared to Bicubic and Nearest neighbor. However, SRGAN still fails to alleviate the blurring details on the lines and edges of the SR results. In addition, the results of ESRGAN still suffer from apparently broken artifacts and dramatic degradation problems, which unfaithful to human perception. As for the unsupervised method CycleGAN and ZSSR, the SR results are merely improved but there are still far away from the ground truth. Although the SR images of CycleGAN present better shapes than previous compared methods, the results still remain unnatural edges and distortions, leading to poorly visual effects. Besides, the blind method ZSSR was also evaluated, but it fails to reduce visible corruptions in the degree, since there are still over-smoothed textures and noise-like characteristics existed in the images.
Compared with the aforementioned method, the SR results of our method superiorly outperforms all other methods. It is noticeable that our method is able to produce visually pleasant SR images with sharper edges and finer textures. Explicitly, traditional methods Bicubic and Nearest Neighbor have the limited ability to deal with complex real world SR problems properly. Specifically, our SR results are more realistic than SRGAN and ESRGAN. The reason is that these two methods are barely trained on simple degradation data (e.g., bicubically images ) without introducing any complicated noise and artifacts from real world image while our method actually trains with approximated real-like LR images consisting of similar characteristics as real world LR images. Particularly, the unsupervised method CycleGAN is less effective to super-resolve unclear LR images. Although it involves the cycle translation model, it lacks a powerful super-resolution network as the one (SESRN) used in our method. Besides, the other unsupervised method ZSSR also does not achieve the expected results, since it does not take into account the domain gap between noise-free LR images and real-world images. In contrast, benefiting from the domain transfer network (UBCDTN), our method is able to successfully eliminate domain bridge and produce real-like LR images comprising real-world patterns. Overall, The SR results verify the powerful unsupervised learning strategy used in the proposed method for super-resolving photo-realistic SR images.
In this section, we conduct the ablation study to further investigate the components of the proposed method and demonstrate the advantages of the UBCDT-GAN. The list of compared variants of our method is presented in Table 3. We provide visual SR results of different variants as shown in Figure 11 and 12. The quantitative comparison with the several variants is presented in Table 4.
Figure 11. Qualitative comparisons of different variants in our ablation study. The visual results on image "0829", "0896", "0824" from NTIRE 2020 Track 1 testing dataset with scale factor ×4. The best results are highlighted.
Figure 12. Qualitative comparisons of different variants in our ablation study. The visual results on image "0803", "0836", "0861" from NTIRE 2020 Track 1 testing dataset with scale factor ×4. The best results are highlighted.
Table 3. The compared variants of the proposed method in the ablation study and the descriptions of the proposed components. The tick indicates that this variant includes this component
Table 4. Quantitative results of ablation study with different variants on NTIRE 2020 validation T1 dataset, in terms of average PSNR (dB) and SSIM for upscale factor 4×. The bold results indicate the best performance.
Description of different variants of the proposed method
In ablation studies, we design several variants which consist of different proposed components. Note that since the advantages of the components in SESRN have been verified in the SEGA-FURN, we pay more attention to investigating the elements used in the UBCDTN. Thus, we adopt the SESRN as the baseline variant in the following experiments. To comply with the single variable principle, we gradually add one of the components in the proposed method to the baseline variant. We first describe the details of designed variants, all of which are specified as follows:
- VariantA: The VariantA is designed as the baseline variant, which only contains SESRN. As shown in Table 5.3, the VariantA can be considered as removing all UBCDTN components from the ultimate proposed method. In the following variants, we successively add each of the components to VariantA.
- VariantB: In VariantB, we introduce
$G_{A}$ and$D_{B}$ while the$G_{B}$ ,$D_{A}$ and both$FE_{A}$ ,$FE_{B}$ are removed. Becuase$G_{A}$ and$D_{B}$ are essential components of forward cycle module in UBCDTN, VariantB can be considered as composed of forward cycle module of UBCDTN and baseline model, removing backward cycle module. - VariantC: Besides baseline model, it consists of two generators
$G_{A}$ ,$G_{B}$ and two feature extractors$FE_{A}$ of UBCDTN, eliminating discriminator$D_{A}$ and$D_B$ involved in UBCDTN. - VariantD: It is constructed by the four components
$G_{A}$ ,$G_{B}$ ,$D_{B}$ and$D_{A}$ of UBCDTN, while it removes feature extractor$FE_{A}$ and$FE_{B}$ of UBCDTN at the same time. - VariantE (Proposed): The VariantE represents the ultimate proposed method in which it comprises of baseline model and all components of UBCDTN.
Next, in order to verify the effectiveness of four variants and proposed components, we present four comparative analyses in the following sections.
Effect of UBCDTN
This experiment is conducted by VariantA and VariantE. Specifically, because of removing UBCDTN, VariantA is trained on bicubic downsampling LR images directly while the VariantE takes real-like LR images obtained by UBCDTN as the training LR inputs. According to the analysis of the performance between VariantA and VariantE, we can demonstrate advantages originating from UBCDTN. As shown in Figure 11 and 12, VariantA produces over-smoothed SR images missing high-frequency details while the SR results of VariantE contain naturally desired edges and textures. In addition, from the Table 4, the quantitative results of VariantA are decreased dramatically from 26.83dB/0.789 to 25.97dB/0.757 after removing UBCDTN. The reason is that VariantA simply trains on bicubic data, ignoring the domain distribution difference between bicubic data and real world data when solving the real world SR task. By incorporating the UBCDTN in the variant, there is a noteworthy improvement in terms of both qualitative and quantitative performance, which is able to verify that UBCDTN plays an important role in the super resolution procedure. Thus, we can demonstrate the effectiveness of the proposed UBCDTN and its necessity.
Effect of
In this experiment, we compare VariantB and VariantE to verify the effect of
Effect of
In this experiments, we aim to demonstrate the contribution of
Effect of
We introduce the VariantD and VariantE in this experiment to verify the effectiveness of
Effect of ultimate Proposed
The VariantE can be considered as the ultimate proposed, which includes all of the proposed components. Compared with other variants, the ultimate proposed method is able to greatly improve quantitative performances and obviously enhance the quality of visual results. Thus, we can demonstrate the effectiveness of the proposed method as well as all of the components.
In this paper, we presented an unsupervised super-resolution method UBCDT-GAN for real-world scenarios, which does not involve any paired image data and assume a pre-defined degradation operation. The proposed method comprises two networks, UBCDTN and SESRN. First, the UBCDTN transfer artificially degraded image to a real-like image with real-world artifacts and characteristics. Next, SESRN reconstructed the approximate real-like LR image to a visually pleasant super-resolved image with realistic details and textures. Furthermore, we also employed several persuasive objective loss, (i.e., cycle-consistency loss, adversarial loss, identity loss, pixel-wise loss, and perceptual loss), in the super-resolution process to optimize the proposed method. According to the designed framework and applied optimization constraints, the proposed method UBCDT-GAN have the ability to improve the real world super resolution performance. The quantitative and qualitative experiments on NTIRE 2020 T1 real world SR dataset validate the effectiveness of our method and show superior SR performances compared to existing state-of-the-art methods.