Real-World Image Super Resolution via Unsupervised Bi-directional Cycle Domain transfer Learning based Generative Adversarial Network

Import Disclaimer Note

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

Overview

Deep Convolutional Neural Networks have exhibited impressive performance on image super-resolution by reconstructing a high resolution image from a low resolution image. However, most state-of-the-art methods heavily rely on two limited properties where the training LR and HR images are paired and artificially pre-determined by known degradation kernel (i.e., bicubic downsampling ) to train the networks in the fully supervised fusion. As a result, existing methods fail to deal with real world super resolution tasks, since the paired LR and HR images in real world scenes are typically unavailable and degraded by the complicated and unknown kernel. To break these restrictions, in this paper, we propose the Unsupervised Bi-directional Cycle Domain Transfer Learning-based Generative Adversarial Network (UBCDT-GAN), which has the ability to super-resolve HR image from the real world LR image with complex and inevitable sensor noise in an unsupervised manner. Our proposed method consists of an Unsupervised Bi-directional Cycle Domain Transfer Network (UBCDTN) and Semantic Encoder guided Super Resolution Network (SESRN). Firstly, the UBCDTN is able to produce approximated real-like LR image through transferring the LR image from artificial degraded domain to the real-world LR image domain with natural characteristics. Secondly, the SESRN takes the approximated real-like LR image as input and super-resolves it to a photo-realistic HR image. Extensive experiments on unpaired real-world image benchmark datasets demonstrate that the proposed method achieves promising performance compared to state-of-the-art methods. The overview of our method is shown in Fig.1

Figure 1. The overview of the proposed UBCDT-GAN: In the first stage (left), the green dot rectangle represents Unsupervised Bi-direction Cycle Domain Transfer Network (UBCDTN). The red path indicates the forward cycle module, given the input HR image $I^{HR}$, $I^{LR}{degraded}$ is the artificial degraded LR image and real-like LR image $\widehat{I}^{LR}{real}$ is generated by $G_{A}$. $\widehat{I}^{LR}{recon}$ represents the reconstructed image. $\widehat{I}^{LR}{idt}$ is produced by $G_{B}$. In addition, $L^{adv}{D{B}}$, $L^{idt}{degraded}$, $L^{cyc}{G_{B}}$ and $L^{percep}{FE{A}}$ depicted in red dotted line represent adversarial loss, identity loss, cycle-consistency loss and cycle-perceptual loss for forward cycle module. Symmetrically, the blue path shows the backward cycle module, where $I^{LR}{real}$ is given by real-world dataset and synthesized LR image $I^{LR}{syn}$ is generated by $G_{B}$. Moreover, the $G_{A}$ is able to translate $I^{LR}{syn}$ back to reconstructed real-world LR image $L^{LR}{recon}$ and generate the identified real-world LR image $I^{LR}{idt}$. The blue dotted line represents the adversarial loss $L^{adv}{D_{A}}$, identity loss $L^{idt}{real}$, cycle consistency loss $L^{cyc}{G_{B}}$ and perceptual loss $L^{percept}{FE{A}}$ for backward cycle module respectively. In the second stage (right), the framework of Semantic Encoder guided Super-Resolution Network (SESRN) is depicted in yellow dot rectangle, where it consists of Semantic Encoder $SE$, Generator $G_{SR}$, Joint Discriminator $D_{SR}$ and Content Extractor $\phi$. There are two paths in the SESRN, where the red path indicates real tuple and the blue path is a fake tuple. $I^{SR}$ is SR images from $G_{SR}$. Furthermore, $SE(\cdot)$ denotes the embedded semantics obtained from $SE$. $D(\cdot)$ represents the output probability of $D_{SR}$. $\phi(I^{HR})$ and $\phi(I^{SR})$ describes the features learned by $\phi$.

Contributions

The main contributions of our proposed method can be summarized as follows:

We propose a novel bi-directional cycle domain transfer network, UBCDTN. According to the domain transfer learning scheme, the designed bi-directional cycle architecture is able to eliminate the domain gap between the generated real-like LR images and real-world images in an unsupervised manner.
We further imposed the auxiliary constraints on the UBCDTN by incorporating adversarial loss, identity loss, and perceptual loss, which can guarantee that the real-like LR images contain the same style as real-world images.
We adopted the previous proposed SESRN as a deep super resolve network to generate visually pleasant SR results under the supervised learning settings.
Benefiting from the collaborative training strategy, the proposed UBCDT-GAN is able to train in an end-to-end fashion, which results in easing the entire training procedure and strengthen the robustness of the model.

Proposed Method

In this section, all the details of the proposed UBCDT-GAN will be described. We will first introduce our method, which mainly consists of two networks. The first network Bi-directional Cycle Domain Transfer Network (BCDTN) aims to perform domain translation operation on two different domain image sets. It contains forward cycle module and backward cycle module, and the pipeline is shown in Fig 1. The second network is SESRN consists of Semantic Encoder (SE), Joint Discriminator $D_{SR}$, Generator $G_{SR}$ and Content Extractor $CE$. The SESRN takes the real-like LR image as input and super-resolve it to HR image, and the details can be seen in Fig 2. Moreover, we will introduce the design of the loss functions. The overview of the proposed as shown in Fig 1.

Figure 2. The proposed SESRN and its components: Semantic Encoder $SE$, Generator $G_{SR}$, Joint Discriminator $D_{SR}$ and Content Extractor $\phi$. For $D_{SR}$, ESLDSN represents the Embedded Semantics-Level Discriminative Sub-Net, ILDSN represents the Image-Level Discriminative Sub-Net, and FCM denotes Fully Connected Module. As for the generator $G_{SR}$, there are three stages: Shallow Feature Module (SFM), Multi-level Residual Dense Module (MRDM), and Upsampling Module (UM). $I^{HR}$ and $\widehat{I}^{LR}{real}$ denote HR images and LR images respectively. $I^{SR}$ is SR images from $G$. Furthermore, $SE(\cdot)$ denotes the embedded semantics obtained from $SE$. $D{SR}(\cdot)$ represents the output probability of $D_{SR}$. $\phi(I^{HR})$ and $\phi(I^{SR})$ describes the features learned by content extractor $\phi$.

Notations

In the unsupervised super resolution problem, given a set of HR images and unpaired real-world LR images with unknown degradation, we denote HR images as $I^{H}$, real-world LR images as $I^{LR}{real}$, and $\widehat{I}^{LR}{real}$ represents the artificially degraded images processed by image degradation block. Moreover, we regard the real-world LR images $I^{LR}{real}$ as the target domain. Since the dimension of input HR images contradicts with the real-world LR images, we first perform bicubic downscaling to $I^{HR}$ by image degradation block to obtain the artificially degraded images $I^{LR}{degraded}$ which have the same size as $I^{LR}{real}$. Then, we denote artificial degraded LR images $I^{LR}{degraded}$ as source domain. The goal of our method is to learn a mapping $SR$ to super resolve $\widehat{I}^{LR}{real}$ as $I^{SR} = SR(\widehat{I}^{LR}{real})$ in which $I^{SR}$ has the same distribution as $I^{HR}$. Nevertheless, as mentioned before, it is difficult to generate super-resolved image $I^{SR}$ directly from real-world image $I^{LR}_{real}$ to achieve the expected result. Thus, we divide the task into two problems but it is still an end-to-end fashion. The proposed method composes of two stages: 1) Unsupervised image domain translation between real-world LR images and artificial degraded LR images. 2) Semantic encoder guided generative adversarial super resolution from the real-like LR images to final HR images in a supervised manner.

Overview

In the first stage, the proposed UBCDTN aims to transfer the domain of artificially degraded images to the real world domain, which can ensure a similar real-world LR pattern in the generated LR images. As shown in Fig 1, the red path indicates the forward-cycle module, and the backward-cycle module is represented as the blue path. The forward-cycle module comprises of generator $G_{A}$, discriminator $D_{B}$ and Feature Extractor $FE_{A}$. As for the backward-cycle module, it consists of $G_{B}$, $D_{A}$, and $FE_{B}$, in which these networks have the same architecture as forward-cycle module corresponding components while they are designed for a different purpose. After domain translation preformed by BCDTN, we obtain the real-like LR $\widehat{I}^{LR}_{real}$ treated as the input of the following SESRN. In the second stage, we utilized SESRN to learn the super resolution mapping from the real-like LR space to the HR space. The SESRN employs the main architecture of previous work SEGA-FURN. While different from the SEGA-FURN which focuses on the face hallucination task, in this case, we fin-tune the SESRN on the real-world images with a large diversity of contents. We further train the SESRN with adversarial loss, content loss, and pixel-wise loss to generate more photo-realistic and detailed SR images.

Unsupervised Bi-directional Cycle Domain Transfer Network

We first design a Unsupervised Bi-directional Cycle Domain Transfer Network (UBCDTN). The specific network architecture of generators $G_{A}$, $G_{B}$ and discriminators $D_{A}$, $D_{B}$ is presented in Fig. 4 and Fig. 5 respectively. In the unsupervised condition, UBDCTN is able to translate the source domain $I^{LR}{degraded}$ to the target domain $I^{LR}{real}$. In other words, It provides an effective bidirectional cycle solution to reduce the domain gap between $I^{LR}{degraded}$ and $I^{LR}{real}$. As for forward-cycle module, it contains $G_{A}$ where it aims to learn a mapping $I^{LR}{degraded} \rightarrow \widehat{I}^{LR}{real}$. And the goal of $G_{B}$ in backward-cycle module is learn $I^{LR}{real} \rightarrow I^{LR}{syn}$.

The UBCDTN simultaneously trains two generators, where these two generators should be translated in bi-direction and inverted each other. We involve adversarial learning in both modules, where $G_{A}$ and $G_{B}$ are trained on discriminator $D_{B}$, $D_{A}$ respectively. However, applying adversarial loss alone to train UBCDTN leads to the problem of model collapse and it can not produce the desired generated images which have the same characteristics and patterns as target domain images, as mentioned before, since the SISR task is an ill-posed problem in which there exist several possible SR images reconstructed from one given LR input. It is necessary to employ regularization on UBCDTN to improve the quality of translated images. Thus, we propose cycle-consistency constraint to guarantee the the domain correlation between the ($\widehat{I}^{LR}{real}$, $I^{LR}{real}$) and ($I^{LR}{syn}$, $I^{LR}{degraded}$). In addition, in order to avoid the color variation, the identity loss is applied to preserve color composition between the input image and output image \cite{CycleGAN}. Moreover, we introduce perceptual loss as an additional constraints to preserve the shaper edges and finer structures in the reconstructed images. The following sections provide more details of the forward-cycle module and backward-cycle module.

Forward-cycle module

The pipeline of forward-cycle module is shown in blue path of UBCDTN in Figure 5.1. The forward-cycle module contains a generator $G_{A}$, disciriminator $D_{B}$ and feature extractor $FE_{A}$. The goal of $G_{A}$ is to map the artificially degraded LR image to the target domain image, and thus generate real-like LR image, i.e., $\widehat{I}{real}^{LR} = G{A}((I^{LR}{degraded}){i})$. The concern of the unpaired condition can be eliminated by adding additional constriants. We propose the cycle-consistency to guarantee the relevant content in the generated $\widehat{I}^{LR}{real}$ can be preserved, i.e., $G{B}(G_{A}((I^{LR}{degraded}){i})) \approx I^{LR}{degraded}$. In a words, the generated images of $G{A}$ and $G_{B}$ should be cycle-consistent with each other. Thus, the forward cycle consistency is established $I^{LR}{degraded} \rightarrow G{A}(I^{LR}{degraded}) \rightarrow G{B}(G_{A}(I^{LR}{degraded})) \approx I^{LR}{degraded}$. The forward cycle consistency loss $L^{cyc}{G{B}}$ can be defined as follows:

$\widehat{I}{recon}^{LR} = G{B}(G_{A}((I^{LR}{degraded}){i}))$

$L^{cyc}{G{B}}(G_{A},G_{B},I^{LR}{degraded}) = \frac{1}{N}\sum{i}^{N}||(\widehat{I}^{LR}{recon}){i} - (I^{LR}{degraded}){i} ||_{1}$

where $\widehat{I}{recon}^{LR}$ represents the reconstructed LR images generated by $G{B}$. As Equation 5.2 shows, with the help of $G_{B}$, $\widehat{I}^{LR}{recon}$can be identical to the input $I^{LR}{degraded}$. From Equation 5.3, the cycle consistency mechanism allows $I^{LR}{degraded}$ in source domain can be reconstructed after performing $G{A}$ and $G_{B}$ to $I^{LR}{degraded}$ in turn. In addition, we apply adversarial losses to $G{A}$ and $D_{B}$ such that the distribution of $\widehat{I}^{LR}{real}$ is indistinguishable from real distribution $I^{LR}{real}$. Instead of utilizing the loss function of SGAN \cite{GAN} which cause the problem of model collapse, we apply the Relativistic GAN (RaGAN) \cite{RaGAN} to train $G_{A}$ and $D_{B}$. The $G_{A}$ receives the input $I^{LR}{degraded}$ and generates $\widehat{I}^{LR}{real}$, which can fool the $D_{B}$. As for discriminator $D_{B}$, it aims to predict the probability that the provided real tuple $X_{real}$ is more realistic compared to the generated fake tuple $X_{fake}$. Since the input in the RaGAN training stage should be the real data along with fake data together, we define the real tuple $X_{real}$ and fake tuple $X_{fake}$ respectively. It can be summarized as follows:

$X_{real} = (I^{LR}{real},\widehat{I}{real}^{LR})$

$X_{fake} = (\widehat{I}{real}^{LR},I^{LR}{real})$

$D(X_{real}) = \sigma (C(X_{real}) - E_{x_{f}}[C(X_{fake})])$

$D(X_{fake}) = \sigma (C(X_{fake}) - E_{x_{r}}[C(X_{real})])$

where $\sigma$ denotes sigmoid non-linearity, the $C$ denotes the discriminator without final sigmoid layer, and $E_{x_{f}}$, $E_{x_{r}}$ is mathematical means of $X_{fake}$ and $X_{real}$ in the training batch respectively. $D(X_{real})$ and $D({X_{fake}})$ is the predicted probability of $X_{real}$, $X_{fake}$ to be real by discriminator $D_{B}$. An adversarial loss for $G_{A}$, $D_{B}$ can be expressed as $L^{adv}{G{A}}$ and $L^{adv}{D{B}}$ respectively. These equations are expressed as:

$L^{adv}{G{A}}(G_{A},D_{B},I^{LR}{real},\widehat{I}{real}^{LR})= -\mathbb{E}{I^{LR}{real}}\sim p_{(I^{LR}{real})} [ log ( 1- D(X{real})) ]-\mathbb{E}{\widehat{I}{real}^{LR}}\sim p_{(\widehat{I}{real}^{LR})}[ log ( D( X{fake} ) ) ]$

$L^{adv}{D{B}}(G_{A},D_{B},I^{LR}{real},\widehat{I}{real}^{LR})= -\mathbb{E}{I^{LR}{real}}\sim p_{(I^{LR}{real})} [ log ( D(X{real})) ]-\mathbb{E}{\widehat{I}{real}^{LR}}\sim p_{(\widehat{I}{real}^{LR})}[ log ( 1-D( X{fake} ) ) ]$

where $N$ is the number of training images. $\mathbb{E}{I^{LR}{real}}\sim p_{(I^{LR}{real})}$ and $\mathbb{E}{\widehat{I}{real}^{LR}}\sim p{(\widehat{I}{real}^{LR})}$ indicates the distribution of real image $I^{LR}{real}$, and fake image $\widehat{I}^{LR}{real}$ respectively. Moreover, we further introduce forward identity loss $L^{idt}{degraded_LR}$ to maintain color composition between $\widehat{I}^{LR}{idt}$ and $I^{LR}{degraded}$. The $\widehat{I}^{LR}{idt}$ can be generated by $G{B}$, as Equation 5.10 shows. The $L^{idt}_{degraded_LR}$ is expressed as Equation 5.11:

$\widehat{I}{idt}^{LR} = G{B}((I^{LR}{degraded}){i})$

$L^{idt}{degraded_LR}(G{B},I^{LR}{degraded}) = \frac{1}{N}\sum{i}^{N}||(\widehat{I}^{LR}{idt}){i} - (I^{LR}{degraded}){i} ||_{1}$

Moreover, in order to minimize perceptual divergence between $\widehat{I}^{LR}{recon}$ and $I^{LR}{degraded}$, we utilze $FE_{A}$ to extract VGG feature maps for estimation. It can be defined as follow:

$L^{percep}{FE{A}}(FE_{A},G_{A},G_{B},I^{LR}{degraded})= \frac{1}{N}\sum{i}^{N}||FE_{q,r}(G_{B}(G_{A}((I^{LR}{degraded}){i})))- FE_{q,r}((I^{LR}{degraded}){i})||_{2}$

where $FE_{q,r}(\cdot)$ indicates the extracted feature maps of $q$-th convolution layer (after activation layer) before $r$-th maxpooling layer. Thus, the total objective loss for forward cycle module $L^{Forward}_{total}$ is a weighted sum of the four loss function:

$L^{Forward}{total}(G{A},G_{B},D_{B},FE_{A},I^{LR}{degraded},I^{LR}{real}) = \omega_{1}L^{adv}{G{A}}+\omega_{2}L^{cyc}{G{B}}+\omega_{3}L^{idt}{degraded_LR}+\omega{4}L^{percep}{FE{A}}$

where the hyper-parameters $\omega_{1}$, $\omega_{2}$, $\omega_{3}$, $\omega_{4}$ are trade-off factors for each loss. We empirically set $\omega_{1}$, $\omega_{2}$, $\omega_{3}$, and $\omega_{4}$ to 1, 10, 1 and 1 respectively. It is noticeable that the loss with a high weights indicates a significant proportion of training process.

Backward-cycle module

To transfer image from target domain to source domain, i.e., $I^{LR}{real} \rightarrow I^{LR}{syn}$, we specific construct backward-cycle module in which the identified generator $G_{B}$ is able to learn the mapping $G_{B}(I^{LR}{real}) \approx I^{LR}{syn}$. The learned mapping is capable of re-expressing the target domain image $I^{LR}{real}$ by the source domain image $I^{LR}{syn}$ implicitly. The framework of the backward-cycle module is shown in the red path of UBCDTN in Figure 1. It consists of a generator $G_{B}$, discriminator $D_{A}$ and feature extractor $FE_{B}$. There are several constraints as the objective functions required for training backward-cycle module. First, we symmetrical design the backward cycle consistency constraint, i.e., $G_{A}(G_{B}((I^{LR}{real}){i})) \approx I^{LR}{real}$, which further force the contents of generated images to be relevant to the input ones. Thus, the backward cycle is composed $I^{LR}{real} \rightarrow G_{B}(I^{LR}{real}) \rightarrow G{A}(G_{B}(I^{LR}{real})) \approx I^{LR}{real}$. The backward cycle consistency loss can be defined as follows:

$I_{recon}^{LR} = G_{A}(G_{B}((I^{LR}{real}){i}))$

$L^{cyc}{G{A}}(G_{A},G_{B},I^{LR}{real})= \frac{1}{N}\sum{i}^{N}||(I^{LR}{recon}){i}) - (I^{LR}{real}){i} ||_{1}$

where $I_{recon}^{LR}$ represents the reconstructed LR images generated by $G_{A}$. As Equation 5.14 shows, because of formulating the backward cycle scheme, the $G_{B}$ and $G_{A}$ is capable of inversing the reconstructed images $I^{LR}{recon}$ back to the $I^{LR}{real}$. From Equation 5.15, it is able to minimize the correlation discrepancy between $I^{LR}{recon}$ and $I^{LR}{real}$ such that the consistent characteristics can be preserved. Similarly, as forward cycle module, we also adopt adversarial learning in this module, where the generator $G_{B}$ and discriminator $D_{A}$ is optimized by RaGAN. In the backward cycle module, we define the real tuple $X_{real}$ and fake tuple $X_{fake}$ as follow:

$X_{real} = (I^{LR}{degraded},I{syn}^{LR})$

$X_{fake} = (I_{syn}^{LR},I^{LR}_{degraded})$

$D(X_{real}) = \sigma (C(X_{real}) - E_{x_{f}}[C(X_{fake})])$

$D(X_{fake}) = \sigma (C(X_{fake}) - E_{x_{r}}[C(X_{real})])$

where $E_{x_{f}}$, $E_{x_{r}}$ are the average of $X_{fake}$ and $X_{real}$ respectively. $D(X_{real})$ and $D({X_{fake}})$ is the predicted probability for $X_{real}$, $X_{fake}$. The input $I^{LR}{real}$ is passed to the $G{B}$ where it attempts to generate $I^{LR}{syn}$ that similar to input so as to deceive the $D{A}$. We define the objective adversarial loss of backward cycle module as $L^{adv}{G{B}}$ and $L^{adv}{D{A}}$ respectively. These equation are defined as:

$L^{adv}{G{B}}(G_{B},D_{A},I^{LR}{real},I^{LR}{syn})= -\mathbb{E}{I^{LR}{degraded}}\sim p_{(I^{LR}{degraded})} [ log ( 1- D(X{real})) ]-\mathbb{E}{I{syn}^{LR}}\sim p_{(I_{syn}^{LR})}[ log ( D( X_{fake} ) ) ]$

$L^{adv}{D{A}}(G_{B},D_{A},I^{LR}{real},I^{LR}{syn})= -\mathbb{E}{I^{LR}{degraded}}\sim p_{(I^{LR}{degraded})} [ log ( D(X{real})) ]-\mathbb{E}{I{syn}^{LR}}\sim p_{(I_{syn}^{LR})}[ log ( 1-D( X_{fake} ) ) ]$

where $\mathbb{E}{I^{LR}{degraded}}\sim p_{(I^{LR}{degraded})}$ and $\mathbb{E}{I_{syn}^{LR}}\sim p_{(I_{syn}^{LR})}$ indicates the distribution of real image $I^{LR}{degraded}$, and the fake image $I^{LR}{syn}$ respectively. According to Equation 5.20 and Equation 5.21, The $G_{B}$ not only increase the possibility of the fake data to be real but also decrease the possibility of the real data to be real, but for the $D_{A}$, it takes into evaluation how much the input images are realistic than given generated images. In order to avoid color variation between $I^{LR}{idt}$ and $I^{LR}{real}$, we further define backward identity loss $L^{idt}{real_LR}$. The $I^{LR}{idt}$ is generated by $G_{A}$, as Equation 5.22 shows. The $L^{idt}_{real_LR}$ is defined as Equation 5.23.

$I_{idt}^{LR} = G_{A}((I^{LR}{real}){i})$

$L^{idt}{real_LR}(G{A},I^{LR}{real})= \frac{1}{N}\sum{i}^{N}||(I^{LR}{idt}){i} - (I^{LR}{real}){i} ||_{1}$

Moreover, the backward perceptual loss $L^{percep}{FE{B}}$ is calculated to recover visual pleasing details. We utilize $FE_{B}$ to measure the Euclidean distance between feature maps of $I^{LR}{recon}$ and $I^{LR}{real}$. It can be defined as follow:

$L^{percep}{FE{B}}(FE_{B},G_{A},G_{B},I^{LR}{real})= \frac{1}{N}\sum{i}^{N}||FE_{q,r}(G_{A}(G_{B}((I^{LR}{real}){i})))- FE_{q,r}((I_{real}^{LR}){i})||{2}$

where $FE_{q,r}(\cdot)$ represents the extracted feature maps from $q$-th convolution layer (after activation layer) before $r$-th maxpooling layer of the $FE_{B}$. In the end, the total optimization loss $L^{Backward}_{total}$ for the backward cycle module can be formulated as:

$L^{Backward}{total}(G{A},G_{B},D_{A},FE_{B},I^{LR}{real},I^{LR}{degraded}) = \lambda {1}L^{adv}{G_{B}}+\lambda_{2}L^{cyc}{G{A}}+\lambda_{3}L^{idt}{real_LR}+\lambda{4}L^{percep}{FE{B}}$

where the hyper-parameters $\lambda_{1}$, $\lambda_{2}$, $\lambda_{3}$, $\lambda_{4}$ are the set of corresponding weights for $L^{adv}{G{B}}$, $L^{cyc}{G{A}}$, $L^{idt}{real_LR}$ and $L^{percep}{FE_{B}}$. The $\lambda_{1}$, $\lambda_{2}$, $\lambda_{3}$, and $\lambda_{4}$ are empirically set to 1, 10, 1 and 1respectively.

Total Unsupervised Bi-directional Cycle Domain Transfer Network loss

The full optimization objective loss for UBCDTN consists of all the losses presented in the above sections. It is an addition of forward cycle module loss $L^{Forward}{total}$ and backward cycle module loss $L^{Backward}{total}$. It can be represented as follows:

$L^{UBCDTN}{total} = L^{Forward}{total} + L^{Backward}_{total}$

Finally, we adopt the $L^{UBCDTN}_{total}$ to optimize the whole UBCDTN.

Network architecture

In this section, we first present the architecture of $G_{A}$ and $G_{B}$, and then describe the basic structure of $D_{A}$ and $D_{B}$. Finally, we introduce the VGG19 based feature extractor $FE_{A}$ and $FE_{B}$.

Figure 3. The architecture of $G_A$ and $G_B$. The $K$, $n$, $s$ indicates kernel size, number of filters, and the stride size. Conv denote convolutional layer, IN is instance normalization layer and UP represents Upsampling layer. The shallow features are concatenated with deep features through skip connection.

Generator: $G_{A}$ and $G_{B}$

The architecture of $G_{A}$ and $G_{B}$ is based on U-Net. As shown in Figure 3, the architecture of $G_{A}$, $G_{B}$ based on Encode-Decoder structure with additional skip connections, which allows the downsampling block of the encoder to be connected by the upsampling block of the decoder part. In the $G_{A}$ and $G_{B}$, the encoder consists of multiple downsampling blocks in which each block have one strides 2 convolutional layer with 4x4 kernel size followed by a Leaky ReLU activation layer and Instance Normalization (IN) layer. As the layers increase, the channels of the feature map double from 64 to 256 while the image size is progressively decreased. The output feature maps of the middle layer are regarded as latent features which are used to be the input of the decoder part. Because of skip connections, each upsampling block is supposed to concatenate with an output of the corresponding encoder layer. There are several upsampling blocks in the decoder where each upsampling block includes the one upsampling layer, convolutional layer, and IN layer. Specifically, within each block, the upsampling layer and convolutional layer with the kernel size 4x4 and stride 1 increase the input feature maps size while decrease the number of channels from 256 to 3. The final block is one upsampling layer followed by the $tanh$ activation layer. With the help of skip connection structure, the $G_{A}$ and $G_{B}$ are able to propagate shallow features to the final layer directly, which can alleviate the vanishing gradient problem and enhance the model stability during training process.

Discriminator: $D_{A}$ and $D_{B}$

Figure 4. The architecture of discriminator $D_{A}$ and $D_{B}$. The terms of $K$, $n$, $s$, represent the corresponding kernel size, number of feature maps, and strides in each convolutional layer. And $N$ indicates the number of neurons in the dense layer

The architecture of $D_{A}$ and $D_{B}$ is illustrated in Figure 4. We follow the architecture of the previous discriminator in SEGAN-FURN. Specifically, the $D_{A}$ and $D_{B}$ are standard convolutional neural networks, which compose of nine convolutional layers followed by BatchNormalization (BN) layers and Leaky ReLU layers. Especially, these convolutional layers with the kernel size 3x3 set the downscale stride 1 or 2 in turn. They increase feature maps from 64 to 512. After nine convolutional blocks, there are two dense layers, Leaky ReLU activation layer, flatten layer, and sigmoid layer stacked to obtain the final probability result.

Feature Extractor: $FE_{A}$ and $FE_{B}$

Figure 5: The architecture of $FE_{A}$ and $FE_{B}$: Inspired by VGG19.The description in each convolutional layer is the index of itself, i.e., "Conv1-1": the first convolutional layer of block 1. The number in each convolutional layer denotes kernel size and the number of feature maps, i.e., ''3-64": kernel size: 3x3, number of feature maps: 64.

In the UBCDTN, we design two feature extractor $FE_{A}$ and $FE_{B}$, and the architecture is reported in Figure 5. We utilize the main structure of the pre-trained convolutional neural network namely VGG19 as feature extractor, $FE_{A}$ and $FE_{B}$, since its remarkable generalization ability and high performance. The pre-trained VGG19 contains 19 layers composing of five convolutional blocks and one fully connected layer block. As for each convolutional block, it consists of a kernel size of 3x3 convolutional layers, ReLU layers, and maxpooling layer. The final fully connected block includes three fully connected layers, the dropout layer, and the final softmax layer. Thanks to the VGG19, the designed $FE_{A}$ and $FE_{B}$ are able to extract valuable feature maps to measure perceptual loss. In this case, we obtain the feature maps from the 4-th convolutional layer before the 3-th maxpooling layer in block3 of VGG19.

Semantic Encoder guided Super Resolution Network

In this section, we demonstrate how to generate the desired $I^{SR}$ from the $\widehat{I}^{LR}{real}$ produced by UBCDTN. Because of the powerful generalization ability of SEGA-FURN, we can adopt the main structure of SEGAN-FURN as Semantic Encoder guided Super Resolution Network (SESRN). In addition, since SEGA-FURN aims to super-resolve face images, we make some modifications to satisfy the demand for real-world natural image super-resolution. First, we enlarge the generator to strengthen feature propagation and accelerate the convergence speed of the training process. Second, we add one more RIDB in each DNB, which can enhance the ability of feature extraction. Third, we fine-tune the whole model on the natural image dataset. From the observation of experiments, all the components proposed in SEGA-FURN show the superior performance on the real-world natural images such as Semantic Encoder $SE$, Joint Discriminator $D{SR}$, Generator $G_{SR}$ and Content Extractor $CE$. First, we describe four main components of SESRN: generator, semantic encoder, joint discriminator, and content extractor. Second, we illustrate the optimization loss function RaLS for updating the generator and joint discriminator respectively. Finally, we express the overall total loss used in SESRN. The architecture of SESRN is illustrated in Figure 2. In addition, the structure of the joint discriminator and generator is shown in Figure 6 Moreover, the architecture of the proposed DNB and RIDB is presented in Figure 7.

Figure 6. Red dotted rectangle: The architecture of the Generator. Blue dotted rectangle: The architecture of the Joint Discriminator. $F_{SF}$ denotes shallow features, $F_{MDBM}$ denotes the outputs of MDBM, $F_{GF}$ represents global features, and $F_{MHF}$ represents multiple hierarchical features. K, n, and s are the kernel size, number of filters, and strides respectively. N is the number of neurons in the dense layer.

Generator

As shown at the top of Figure 6, the generator $G_{SR}$ is similar to the generator of SEGA-FURN. But, unlike the generator in SEGA-FURN, we increase the Dense Nested Block (DBN) from 4 blocks to 16 blocks, which further takes advantage of hierarchical features and eliminate the gradient vanishing problem. The $G_{SR}$ also contains three stages: Shallow Feature Module (SFM), Multi-level Dense Block Module (MDBM), and Upsampling Module (UM). We will describe the procedure of super-resolving step by step. The output of UBCDTN $\widehat{I}^{LR}{real}$ can be seen as the input of the $G{SR}$. In the first step, the real-like LR image $\widehat{I}_{real}^{LR}$ is passed into the SFM as the initial input. It can be expressed as follows:

$F_{SF}=H_{SFM}(\widehat{I}_{real}^{LR})$

where $H_{SFM}$ represents the Conv operation in the SFM, and the shallow features $F_{SF}$ is obtained. In the second step, the $F_{SF}$ is used for global residual learning and serves as the input to the MDBM, where there are 16 DNBs formed by several RIDBs to learn deeper features. The output of $F_{MDBM}$ is expressed as:

$F_{MDBM}=H_{MDBM}(F_{SF})$

where $H_{MDBM}$ denotes the feature extraction of MDBM module comprised of 16 DNBs and multiple RIDBs. The details of $H_{MDBM}$ can be referred to the section 4.4.1 of Chapter\ 4. In the third step, in order to take advantage of multi-level representations, we perform Global Feature Fusion (GFF) on $F_{MDBM}$. Then, the global features $F_{GF}$ can be obtained through fusing feature maps produced by $F_{MDBM}$. The formulation can be expressed as:

$F_{GF}=H_{GFF}(F_{MDBM})$

where $H_{GFF}$ is the compound function of the Conv layer followed by the BN layer, which is able to extract deeper features for global residual learning. In the fourth step, the designed Global Residual Learning (GRL) is applied on shallow features and global features, which can allow the generator to fully use hierarchical features and alleviate gradient vanishing problem. It can be formulated as follow:

$F_{MHF} = F_{SFM}+F_{GF}$

where $F_{MHF}$ represents multiple hierarchical features. Next, in the fifth step, the $F_{MHF}$ is fed into the UM followed by one Conv layer. Here, the fused hierarchical feature $F_{MHF}$ is super-resolved from the LR space to the HR space through upsampling layers in the UM. The super-resolution process can be formulated as:

$I^{SR}=f_{UM}(F_{MHF})=H_{SESRN}(\widehat{I}_{real}^{LR})$

where $f_{UM}$ represents the upsampling operation in UM, $H_{SESRN}$ represents the function of our method SESRN. At the end of the generator, we finally obtain the SR image $I^{SR}$.

Residual in Internal Dense Block

Figure 7. Top: The architecture of Dense Nested Block (DNB). It consists of multiple RIDBs. Bottom: The architecture of proposed Residual in Internal Dense Block (RIDB).

As mentioned in Section 5.4.9, the novel architecture RIDB is proposed for the generator, which is used to form the DNB (as shown in Figure 7). The RIDB utilized in SESRN is similar to RIDB in SEGA-FURN. However, as for DNB, in order to deal with real world problem, we enhance the feature extraction ability of DNB, increasing the RIDBs from 3 to 4 in each DNB to strengthen the flow of hierarchical features through deep DNBs. The details of RIDB can be referred to in section 3.3.2 of Chaptor\ 3. Overall, thanks to the DNB and RIDB, the generator of SESRN is able to extract hierarchical features from the input $\widehat{I}^{LR}_{real}$, which is beneficial to reconstruct SR images delicate details and alleviate the vanishing gradient problem.

Semantic Encoder

In the SESRN, we also introduce the Semantic Encoder to extract semantic embedded (as shown in Figure 2), which is used to project input visual information (LR, HR) back to the latent space. From the experiment results, the semantic encoder still plays the important role in the SESRN. The semantic encoder is able to capture the useful semantic attributes reflected by the input image, which is beneficial to supervise discriminative problems.

As for the classical GAN-based SR models URDGN \cite{URDGN}, SRGAN \cite{SRGAN}, ESRGAN \cite{ESRGAN_Wang}, they lack an ability that is capable of reversing visual image information (LR, HR) to the semantic latent representation \cite{BiGAN} even though they are good at mapping latent representations to image data distributions. Thus, we argue that the critical missing property of these methods is that they only exploit visual information (LR, HR) in their model, as the input during the discriminative procedure, ignoring the high-level semantic information reflected by latent representation. Previous GAN's work \cite{BiGAN,ALI} has proved that the resulting embedded semantics learned from the data distribution is beneficial to the discriminator to distinguish real and fake samples. Thus, in the SESRN, we introduce the semantic encoder to inversely map the real- world image distributions (HR and LR images) back into the latent representation. The same as SEGA-FURN, we named semantic latent representation extracted by the semantic encoder as an embedded semantics, in which it is able to reflect the image structures and attributes. In this case, the embedded semantics along with corresponding visual information (HR and LR images) are fed into the joint discriminator as a joint input tuple, in which it can be seen as the "label" for corresponding images. We utilize the VGG16 network pre-trained by imagenet as the semantic encoder. For the purpose of satisfying the different dimensions of HR and LR images, we adopt two Semantic Encoders which have the same structure but different input dimension (as shown in Figure 2) to obtain embedded semantics from different convolutional layers respectively. Embedded semantics leads to optimize the adversarial process of the generator and the joint discriminator, which drive SESRN to reconstruct the details of the super-resolved image accurately.

Joint Discriminator

As shown in Figure 6, the tuple incorporating both visual information and embedded semantics is fed into the joint discriminator as the input, where Embedded Semantics-Level Discriminative Sub-Net (ESLDSN) aims to identify the embedded semantics whether it comes from the HR images while Image-Level Discriminative Sub-Net distinguishes the input image from HR image dataset or generator. Next, in the Fully Connected Module (FCM), we concatenate two vectors, and then the final probability is predicted. Thanks to this property, the joint discriminator is capable of learning the joint probability distribution of image data ($I^{HR},I^{SR}$) and embedded semantics ($E(I^{HR}),E(\widehat{I}^{LR}{real})$). In order to satisfy this structure, we design two sets of paths entering into the joint discriminator. The set of path in red represents a real tuple which composed of real HR image $I^{HR}$ from the dataset and its embedded semantics $E(I^{HR})$. The set of path in blue indicates the fake tuple which constructed from SR image $I^{SR}$ generated from generator and $E(\widehat{I}^{LR}{real})$ produced by semantic encoder through real-like LR image. Therefore, the joint discriminator has the ability to measure the difference between real tuple $(I^{HR},E(I^{HR}))$ and fake tuple $(I^{SR},E(\widehat{I}^{LR}{real}))$. In addition, instead of applying traditional negative log-likelihood loss to the discriminator, we adopt the Relativistic average Least Squares GAN (RaLSGAN) \cite{RaGAN} optimization loss for the joint discriminator, which is able to alleviate the vanishing gradient problem and stabilize the model training. The RaLsGAN is the method that applies the RaD to the least squares loss function (LSGAN) \cite{LSGAN}. The real tuple here is denoted as $X{real}=(I^{HR},E(I^{HR}))$, and fake tuple is expressed as $X_{fake} = (I^{SR},E(\widehat{I}^{LR}{real}))$. In the training procedure, the joint discriminator receives both $X{real}$ and $X_{fake}$ as the input. It can be expressed as follows:

$\tilde{C}(X_{real}) = (C(X_{real}) - E_{x_{f}}[C(X_{fake})]) \ \tilde{C}(X_{fake}) = (C(X_{fake}) - E_{x_{r}}[C(X_{real})])$

where $\tilde C(\cdot)$ denotes the probability predicted by joint discriminator, $E_{x_{f}}$ and $E_{x_{r}}$ represents the average of the SR images (fake) and HR images (real) in one training batch. Moreover, the least squares loss is applied to evaluate the distance between HR and SR images. Thus, we define the optimization loss $L_{D_{SR}}^{RaLS}$ for joint discriminator, as shown in Equation 5.34, and an adversarial loss $L_{G_{SR}}^{RaLS}$ for the generator is expressed as in Equation 5.35:

$L_{D}^{RaLS}=\mathbb{E}{I^{HR}\sim p{(I^{HR})}}[( \tilde{C}( X_{real})-1)^{2}]+\mathbb{E}{I^{SR}\sim p{(I^{SR})}}[( \tilde{C}( X_{fake})+1)^{2}]$

$L_{G}^{RaLS}=\mathbb{E}{I^{SR}\sim p{(I^{SR})}}[( \tilde{C}( X_{fake})-1)^{2}]+\mathbb{E}{I^{HR}\sim p{(I^{HR})}}[( \tilde{C}( X_{real})+1)^{2}]$

where $I_{HR} \sim P_{I^{HR}}$ and $I_{SR} \sim P_{I^{SR}}$ indicate the HR images and SR images distribution respectively. $\mathbb{E}$ represents the mathematical exception of the probability distribution. To take advantage of least squares loss and relativism in RaLS, SESRN is capable of enhancing model stability and producing visually pleasant SR images.

Content Extractor

In SESRN, we further leverage the pre-trained VGG19 network as content extractor $\phi$ to obtain feature representations, where it used to formulate the content loss $L_{content}$. Specifically, we calculate the $L_{content}$ based on the Euclidean distance between two feature representations of SR images and HR images. We extract feature representations from the "Conv3_3" layer in the content extractor, which is the low-level features before the activation layer. With this loss term, the SESRN is encouraged to reconstruct finer high-frequency details and improve the perceptual quality.

Loss Function

We introduce content loss $L_{content}$ to constrain SR images to be faithful to the human visual perception. Besides, we also involve pixel-wise loss $L_{pixel}$ to optimize our method. Furthermore, an adversarial loss, $L_{G_{SR}}^{RaLS}$ $L_{D_{SR}}^{RaLS}$, is applied to $G_{SR}$ and $D_{SR}$, which allows the generator to produce the SR image consistent with the distribution of HR image.

Content Loss

$L_{content}$ is able to improve the perceptual similarity between SR image and HR image. It is formulated as:

$I^{SR} = G_{SR}((\widehat{I}{real}^{LR}){i})$

$L_{content}=\frac{1}{W_{l} H_{l} C_{l}}|| \phi_{i,j}(I^{HR})-\phi_{i,j}( I^{SR})||^{2}_{2}$

where $W_{l}$, $H_{l}$, $C_{l}$ describe the width, height, number of channels of the feature maps. $\phi_{i,j}(\cdot)$ represents the feature representations obtained from $j$-th convolution layer before $i$-th maxpooling layer in the fixed content extractor.

Pixel-wise Loss

The pixel-wise loss is widely applied to optimize image super-resolution tasks. In our method, we involve pixel-wise loss to enforce the intensity similarity between the super-resolved image $I^{SR}$ generated by $G_{SR}$ and the given HR image $I^{HR}$. It is calculated as:

$L_{pixel} = \frac{1}{N}\sum_{i}^{N}||G_{SR}((\widehat{I}^{LR}{real}){i}) - (I^{HR}){i} ||{2}$

$L_{pixel} = \frac{1}{N}\sum_{i}^{N}||(I^{SR}){i} - (I^{HR}){i} ||_{2}$

where $N$ is the number of images. We use $L_{pixel}$ to further minimize the distance between $I^{SR}$ and $I^{HR}$.

Total Loss

Finally, we obtain the total loss function $L_{total}^{SESRN}$ for SESRN. It is the weighted sum of the three discussed losses: content loss $L_{content}$, adversarial loss $L_{G_{SR}}^{RaLS}$, and pixel-wise loss $L_{pixel}$. The formula is described as follows:

$L^{SESRN}{total} = \lambda {con}L{content} + \lambda{adv}L_{G}^{RaLS}+ \lambda_{pixel}L_{pixel}$

where $\lambda_{con}$ , $\lambda_{adv}$, $\lambda_{pixel}$ are the corresponding weights for the $L_{content}$ ,$L_{G}^{RaLS}$ and $L_{pixel}$. We set $\lambda_{con}$, $\lambda_{adv}$ and $\lambda_{pixel}$ empirically to 1, $10^{-3}$ and 1 respectively. The loss with a high weights indicates a significant proportion of training process.

Full objective loss for UBCDT-GAN

Finally, we formulate the full objective loss for the UBCDT-GAN, which is the combination of $L^{UBCDTN}{total}$ and $L^{SESRN}{total}$. By incorporating the losses previously defined, the final objective loss can be defined as:

$L^{UBCDT-GAN}{total} = L^{BCDTN}{total} + L^{SESRN}_{total}$

The complete objective loss $L^{UBCDT-GAN}_{total}$ encourages the proposed method UBCDT-GAN can deal with unpaired real-world image super-resolution problem.

Experiments

In this section, we first present the datasets and details used for our experiments. Second, we evaluate the quantitative and qualitative performance of the proposed UBCDT-GAN by comparing with several state-of-the-art SISR methods.

Figure 8. The sample images of NTIRE_2020_T1 validation dataset. The top row presents HR images (256 × 256 pixels) and the bottom row shows corresponding LR images (64 × 64 pixels).

Training Data

In the training stage, in order to enrich our training dataset, we conducted experiments on the DF2K dataset \cite{DIV2K, EDSR}, which is a merge of the DIV2K dataset and Flickr2K dataset. The DIV2k dataset contains 800 high-quality (2K resolution) images with a large diversity of contents, which is used for the NTIRE 2017 and NTIRE 2018 Real World Super Resolution Challenges. As for the Flickr2K dataset, it includes 2650 2K images collected from the Flickr2K website. Specifically, as for the LR images, we introduce the real-world LR images from DIV2K NTIRE 2017 unknown degradation 4× dataset, where all the LR images are downgraded with unknown degradation, resulting in sensor noise, compression artifacts, etc. The Flickr2K LR image dataset comes from NTIRE 2020 Real World Track 1 training source dataset. All the LR images are corrupted with unknown degradation kernel and downsampled 4× by an unpredicted operator so as to satisfy with real-world conditions. Since the goal of our method is to solve unsupervised super resolution problem without LR-HR paired image, we select the first 1725 images (number: 1-1725) from the DF2K HR dataset as our HR training dataset, and the LR training dataset is formed by the other 1725 images (number: 1726-3450) obtaining from DF2K real-world LR dataset. Overall, our method is trained on such an unpaired real-world LR-HR dataset.

To evaluate the proposed method on real-world data, in the testing stage, we use the validation dataset from the NTIRE 2020 Real World SR challenge Track 1. This dataset contains 100 testing LR images (scaling factor: 4×), where all LR images are processed with unknown degradation operation to simulate the realistic artifacts and natural characteristics. As shown in Figure 8, we present some simple images from NTIRE_2020_T1 validation dataset. In order to compare the qualitative and quantitative results fairly, we use the same validation dataset for all experiments.

Training Setups

At the training stage, instead of random initializing model weights, we pre-train the UBCDTN and SESRN in the first and second steps, and then we jointly train the whole proposed method in an end-to-end manner. The training procedure is divided into three steps. We first train the UBCDTN with unpaired artificially degraded images $I^{LR}{degraded}$ and real world images $I^{LR}{real}$, which aims to transfer the LR image from artificially degraded LR domain to the real world LR domain. Second, we pre-train the SESRN using the approximated real-like images $\widehat{I}^{LR}{real}$ and its HR version $I^{HR}$ to generated realistic super-resolved images $I^{SR}$. As for pre-train UBCDTN and SESRN, we use the same optimization strategy where the Adam optimizer \cite{Adam} is applied to train both networks by setting $\beta{1}=0.9$, $\beta_{2}=0.999$. The learning rate is initialized as $10^{-4}$ and the minibatch size is set as 8. We train the UBCDTN, SESRN 50K epochs separately. After the pre-training process, both UBCDTN and SESRN are able to require exceptional initialization weights, which is beneficial to training stability and accurate training speed. In the final step, we jointly train two networks together. In a worlds, the proposed method UBCDT-GAN takes HR image $I^{HR}$ and $I^{LR}{real}$ as the input to generate final result $I^{SR}$. The UBCDT-GAN applies the Adam optimizer with an initial learning rate = $10^{-4}$, $\beta{1}=0.9$ and $\beta_{2}=0.999$ to the whole training process. The model is trained for 100K epochs with the minibatch size 8.

Quantitative Metrics

To quantitatively compare the performance of different methods, we utilize the mainstream distortion based metrics Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM) to evaluate the quantitative results.

Comparisons with State-of-the-art Methods

We compare our method with other real-world super-resolution methods on the NTIRE 2020 Track 1 real-world dataset quantitatively and qualitatively. The Figure 9 and 5.10 present the qualitative comparisons of the proposed method with other state-of-the-art methods. In Table 1 and Table 2, we further provide the evaluation of quantitative results between our method and compared methods. It is emphasized that we present the quantitative evaluation of all the state-of-the-art methods, which are cited from their official published paper. Moreover, as for qualitative comparison, we directly download released codes and pre-trained models of the compared methods. Then, we carefully re-implement referenced methods on the same testing dataset to obtain effective results.

Table 1. Quantitative comparison on NTIRE 2020 Real World Super-Resolution Challenge Track 1 validation dataset of the proposed method against participating methods, in terms of average PSNR (dB) and SSIM for upscale factor 4×. The bold results indicate the best performance.

Table 2. Quantitative comparison on NTIRE 2020 Real World Super-Resolution Challenge Track 1 validation dataset of the proposed method against state-of-the-art methods. The bold results indicate the best performance.

Quantitative Comparisions

To validate our proposed method, we utilize image quality criteria PSNR and SSIM in the experiments. For a fair comparison, we evaluate all of the compared methods on the real world images from the NTIRE 2020 Track 1 validation dataset, where all the validation images are corrupted with unknown degradation, resulting in sensor noise and image processing artifacts. The quantitative results of all methods are reported in Table 5.1 and Table 5.2. As for Table 5.1, we show the quantitative comparison between our method and other method participating in NTIRE 2020 Challenge. For Table 5.2, we list the results obtained from several methods trained on bicubic images in a supervised manner. Specifically, as shown in Table 5.1, our method has promising superiority, achieving the highest 26.83dB/0.789 in terms of PSNR/SSIM values. The results obtained from BMIPL-UNIST-YH-1 Team places the second-best 26.73dB/0.752, in which our method improves the PSNR/SSIM values of 0.1dB/0.037 over their method. Although the method provided by BMIPL-UNIST-YH-1 Team involved a cycle mapping scheme, which is the same basic idea as ours, they only introduce cycle constraint without other additional constraints such as an adversarial loss and the perceptual loss, resulting in poor cycle learning mechanism. In addition, they simply used RCAN \cite{RCNN} as the super-resolution model to generated SR images. In a word, benefiting from incorporating UBCDTN and SESRN, our method has the powerful super-resolution ability to generated high quality SR images, thereby achieving better quantitative results than their method. Particularly, it is noticeable that our method improves a large performance enhancement over other participating methods.

As shown in Table 2, we further compare with widely referenced state-of-the-art methods which are trained on bicubic data. In this case, our method still achieves the best performance among other methods. In addition, it is noticeable that our method outperforms the second-best method SRFBN \cite{SRFBN} by the large margin of 1.46dB/0.147 in terms of PSNR/SSIM. EDSR \cite{EDSR} and SRFBN \cite{SRFBN} cannot handle real-world SR tasks well since they are merely trained on simple degradation images. Moreover, we found that the ESRGAN \cite{ESRGAN_Wang} and SRGAN \cite{SRGAN} have worse performance in terms of PSNR/SSIM among all other methods. Besides, from the Figure 9 and 10, the visual results of ESRGAN \cite{ESRGAN_Wang} and SRGAN \cite{SRGAN} also show over-smoothed textures and unrealistic artifacts. Interestingly, we found such phenomenon has also been presented in \cite{CinCGAN,ZSSR,USISResNet}. The underlying reason is that these methods ignore the domain distribution difference caused by bicubic degradation and only take the simple and clean LR images produced by bicubic as the input during the training phase. Thus, this analysis is able to demonstrate the necessity of real world SR methods in real world practical conditions without clean LR images. Thanks to the proposed UBCDTN and SESRN, our method has the ability to solve domain distribution shift problem when dealing with real world LR images in real-world scenes. Overall, from the above analyses, it is obvious that our method boosts a huge performance improvement over compared methods, indicating the effectiveness of the proposed method.

Qualitative comparisions

The visual comparisons are provided in Figure 9 and 10. To comprehensively evaluate the performance of the proposed method, we compare with various SR methods such as Bicubic, Nearest Neighbor, SRGAN \cite{SRGAN}, ESRGAN \cite{ESRGAN_Wang}, CycleGAN \cite{CycleGAN}, and ZSSR \cite{ZSSR}.

To be specific, we first compare our method with the traditional interpolation-based SR method, Bicubic and Nearest Neighbor, which utilize mathematical techniques to recover HR image from LR image. Moreover, we also introduce two NTIRE 2020 Challenge baseline GAN-based SR methods, SRGAN and ESRGAN, which are designed for a supervised manner and trained on Bicubic downsampled image. In addition, we further explore the latest representative unsupervised methods, CycleGAN and ZSSR. The CycleGAN is an unpaired image translation method, which can translate LR image from the source domain to the target HR domain. As for ZSSR, it merely takes a single LR image as input during training and testing stages and learns to explore internal image information to reconstruct the given LR image.

Figure 9. Qualitative comparison of visual results with state-of-the-art methods on NTIRE 2020 Real World Track 1 image "0887", "0822", "0821". Our method produce photo-realistic results.

Figure 10. Qualitative comparison of visual results with state-of-the-art methods on NTIRE 2020 Real World Track 1 image "0891", "0820", "0892". Our method produce photo-realistic results.

As shown in Figure 9 and 10, we give several SR results on validation images from the NTIRE 2020 dataset. Since these LR images degrade by the unknown kernel to simulate real world conditions, the LR images containing sensor noises is severely blurry and unreal. As for two traditional methods, it is obvious that the results of Bicubic and Nearest Neighbor lack high-frequency contents, resulting in overly smooth edges and coarsely textures. Regarding to the SRGAN and ESRGAN, these two methods slightly remove undesirable noises compared to Bicubic and Nearest neighbor. However, SRGAN still fails to alleviate the blurring details on the lines and edges of the SR results. In addition, the results of ESRGAN still suffer from apparently broken artifacts and dramatic degradation problems, which unfaithful to human perception. As for the unsupervised method CycleGAN and ZSSR, the SR results are merely improved but there are still far away from the ground truth. Although the SR images of CycleGAN present better shapes than previous compared methods, the results still remain unnatural edges and distortions, leading to poorly visual effects. Besides, the blind method ZSSR was also evaluated, but it fails to reduce visible corruptions in the degree, since there are still over-smoothed textures and noise-like characteristics existed in the images.

Compared with the aforementioned method, the SR results of our method superiorly outperforms all other methods. It is noticeable that our method is able to produce visually pleasant SR images with sharper edges and finer textures. Explicitly, traditional methods Bicubic and Nearest Neighbor have the limited ability to deal with complex real world SR problems properly. Specifically, our SR results are more realistic than SRGAN and ESRGAN. The reason is that these two methods are barely trained on simple degradation data (e.g., bicubically images ) without introducing any complicated noise and artifacts from real world image while our method actually trains with approximated real-like LR images consisting of similar characteristics as real world LR images. Particularly, the unsupervised method CycleGAN is less effective to super-resolve unclear LR images. Although it involves the cycle translation model, it lacks a powerful super-resolution network as the one (SESRN) used in our method. Besides, the other unsupervised method ZSSR also does not achieve the expected results, since it does not take into account the domain gap between noise-free LR images and real-world images. In contrast, benefiting from the domain transfer network (UBCDTN), our method is able to successfully eliminate domain bridge and produce real-like LR images comprising real-world patterns. Overall, The SR results verify the powerful unsupervised learning strategy used in the proposed method for super-resolving photo-realistic SR images.

Ablation Studies

In this section, we conduct the ablation study to further investigate the components of the proposed method and demonstrate the advantages of the UBCDT-GAN. The list of compared variants of our method is presented in Table 3. We provide visual SR results of different variants as shown in Figure 11 and 12. The quantitative comparison with the several variants is presented in Table 4.

Figure 11. Qualitative comparisons of different variants in our ablation study. The visual results on image "0829", "0896", "0824" from NTIRE 2020 Track 1 testing dataset with scale factor ×4. The best results are highlighted.

Figure 12. Qualitative comparisons of different variants in our ablation study. The visual results on image "0803", "0836", "0861" from NTIRE 2020 Track 1 testing dataset with scale factor ×4. The best results are highlighted.

Table 3. The compared variants of the proposed method in the ablation study and the descriptions of the proposed components. The tick indicates that this variant includes this component

Table 4. Quantitative results of ablation study with different variants on NTIRE 2020 validation T1 dataset, in terms of average PSNR (dB) and SSIM for upscale factor 4×. The bold results indicate the best performance.

Description of different variants of the proposed method

In ablation studies, we design several variants which consist of different proposed components. Note that since the advantages of the components in SESRN have been verified in the SEGA-FURN, we pay more attention to investigating the elements used in the UBCDTN. Thus, we adopt the SESRN as the baseline variant in the following experiments. To comply with the single variable principle, we gradually add one of the components in the proposed method to the baseline variant. We first describe the details of designed variants, all of which are specified as follows:

VariantA: The VariantA is designed as the baseline variant, which only contains SESRN. As shown in Table 5.3, the VariantA can be considered as removing all UBCDTN components from the ultimate proposed method. In the following variants, we successively add each of the components to VariantA.
VariantB: In VariantB, we introduce $G_{A}$ and $D_{B}$ while the $G_{B}$, $D_{A}$ and both $FE_{A}$, $FE_{B}$ are removed. Becuase $G_{A}$ and $D_{B}$ are essential components of forward cycle module in UBCDTN, VariantB can be considered as composed of forward cycle module of UBCDTN and baseline model, removing backward cycle module.
VariantC: Besides baseline model, it consists of two generators $G_{A}$, $G_{B}$ and two feature extractors $FE_{A}$ of UBCDTN, eliminating discriminator $D_{A}$ and $D_B$ involved in UBCDTN.
VariantD: It is constructed by the four components $G_{A}$, $G_{B}$, $D_{B}$ and $D_{A}$ of UBCDTN, while it removes feature extractor $FE_{A}$ and $FE_{B}$ of UBCDTN at the same time.
VariantE (Proposed): The VariantE represents the ultimate proposed method in which it comprises of baseline model and all components of UBCDTN.

Next, in order to verify the effectiveness of four variants and proposed components, we present four comparative analyses in the following sections.

Effect of UBCDTN

This experiment is conducted by VariantA and VariantE. Specifically, because of removing UBCDTN, VariantA is trained on bicubic downsampling LR images directly while the VariantE takes real-like LR images obtained by UBCDTN as the training LR inputs. According to the analysis of the performance between VariantA and VariantE, we can demonstrate advantages originating from UBCDTN. As shown in Figure 11 and 12, VariantA produces over-smoothed SR images missing high-frequency details while the SR results of VariantE contain naturally desired edges and textures. In addition, from the Table 4, the quantitative results of VariantA are decreased dramatically from 26.83dB/0.789 to 25.97dB/0.757 after removing UBCDTN. The reason is that VariantA simply trains on bicubic data, ignoring the domain distribution difference between bicubic data and real world data when solving the real world SR task. By incorporating the UBCDTN in the variant, there is a noteworthy improvement in terms of both qualitative and quantitative performance, which is able to verify that UBCDTN plays an important role in the super resolution procedure. Thus, we can demonstrate the effectiveness of the proposed UBCDTN and its necessity.

Effect of $G_{B}$ and $D_{A}$

In this experiment, we compare VariantB and VariantE to verify the effect of $G_{B}$ and $D_{A}$, which also is identical to show the effectiveness of the backward cycle module. Note that VariantB is equivalent to UBCDTN for domain transformation without the backward cycle module. In this setting, VariantB takes the artificially degraded images as the input to the $G_{A}$ and generates real-like LR images through the supervision of the $D_{B}$ in the forward cycle module without the backward cycle module. It can be observed from Table 5.4 that in the absence of the backward cycle module, VariantB performs worse than VariantE containing the whole UBCDTN in terms of PSNR/SSIM values, since there is no restriction to prevent the forward cycle module and backward cycle module from contradicting each other. A huge enhancement can be observed after integrating the backward cycle module into VariantE, where it is able to greatly improve quantitative performances and further produce high quality SR images with desirable details. This is due to in the presence of backward cycle module. The variant is capable of utilizing cycle consistency constraint, which guarantees the correction between two inverse module in the unsupervised manner. In a word, by introducing $G_{B}$ and $D_{A}$, the whole cycle consistency learning strategy can be established so as to produce the real-like LR images which maintain the same characteristics as real world LR images. These results validate the significance of involving $G_{B}$ and $D_{A}$, which also reveal that backward cycle module can improve quantitative results and visual quality.

Effect of $D_{A}$ and $D_{B}$

In this experiments, we aim to demonstrate the contribution of $D_{A}$ and $D_{B}$, which also can reflect the importance of an adversarial loss in the UBCDTN. The VariantC and VariantE is conducted to investigate the effect of $D_{A}$ and $D_{B}$. In this case, VariantC is trained on cycle-consistency loss, identity loss, and perceptual loss, while the adversarial loss is not involved resulting from removing $D_{A}$ and $D_{B}$. As shown in Table 5.4, it is obvious that performance severely decreases by removing discriminators from VariantC. Through incorporating $D_{A}$ and $D_B$ into VariantE, we observe that VariantE significantly outperforms VariantC by a large margin of 1.39dB/0.06 in terms of PSNR/SSIM. From Figure 11 and 12, the visual results of VariantC are degraded marginally, where the SR results contain over-smoothed textures and unclear artifacts, while VarantE is able to generate visually realistic SR images with more nature-looking details. The underlying reason for poor performance is that $D_{A}$ and $D_{B}$ cannot employ adversarial loss on the VariantC, which leads to the lack of adequate iterative adversarial training. According to the quantitative and qualitative comparisons, it can be verified that by applying adversarial loss performed by $D_{A}$ and $D_{B}$ to the variant, the SR performances are greatly enhanced, indicating the effectiveness of $D_{A}$, $D_{B}$ and adversarial loss.

Effect of $FE_{A}$ and $FE_{B}$

We introduce the VariantD and VariantE in this experiment to verify the effectiveness of $FE_{A}$ and $FE_{B}$, where the advantage of $FE_{A}$ and $FE_{B}$ also equal to the benefit of perceptual loss. By removing $FE_{A}$ and $FE_{B}$, VariantD no longer employs perceptual loss on the training phase to optimize the model. As shown in Table 4, VariantE achieves higher quantitative performances compared to VariantD, increasing PSNR/SSIM values from 25.81dB/0.746 to 26.83dB/0.789. In addition, Figure 11 and 12 clearly show the qualitative comparisons evaluated by the SR results from VariantD and VariantE with and without perceptual loss in the training process. It is apparent that without perceptual loss, deteriorated textures are visible in the SR results of VariantD. In contrast, when integrating $FE_{A}$ and $FE_{B}$ into variant, the perceptual loss can be calculated, which further motivates variant to produce perceptually natural images with exceptional details. According to the numerical performances and visual results, we are able to demonstrate that the $FE_{A}$ and $FE_{B}$ has the significant impact on the super resolution process, which also identify that the perceptual loss is able to improve the perceptual quality of SR images.

Effect of ultimate Proposed

The VariantE can be considered as the ultimate proposed, which includes all of the proposed components. Compared with other variants, the ultimate proposed method is able to greatly improve quantitative performances and obviously enhance the quality of visual results. Thus, we can demonstrate the effectiveness of the proposed method as well as all of the components.

Conclusions

In this paper, we presented an unsupervised super-resolution method UBCDT-GAN for real-world scenarios, which does not involve any paired image data and assume a pre-defined degradation operation. The proposed method comprises two networks, UBCDTN and SESRN. First, the UBCDTN transfer artificially degraded image to a real-like image with real-world artifacts and characteristics. Next, SESRN reconstructed the approximate real-like LR image to a visually pleasant super-resolved image with realistic details and textures. Furthermore, we also employed several persuasive objective loss, (i.e., cycle-consistency loss, adversarial loss, identity loss, pixel-wise loss, and perceptual loss), in the super-resolution process to optimize the proposed method. According to the designed framework and applied optimization constraints, the proposed method UBCDT-GAN have the ability to improve the real world super resolution performance. The quantitative and qualitative experiments on NTIRE 2020 T1 real world SR dataset validate the effectiveness of our method and show superior SR performances compared to existing state-of-the-art methods.

KennethWang1221/Unsupervised-Bidirectional-Cycle-Domain-Transfer-Learning-based-Generative-Adversarial-Network

Real-World Image Super Resolution via Unsupervised Bi-directional Cycle Domain transfer Learning based Generative Adversarial Network

Overview

Contributions

Proposed Method

Notations

Overview

Unsupervised Bi-directional Cycle Domain Transfer Network

Forward-cycle module

Backward-cycle module

Total Unsupervised Bi-directional Cycle Domain Transfer Network loss

Network architecture

Generator: $G_{A}$ and $G_{B}$

Discriminator: $D_{A}$ and $D_{B}$

Feature Extractor: $FE_{A}$ and $FE_{B}$

Semantic Encoder guided Super Resolution Network

Generator

Residual in Internal Dense Block

Semantic Encoder

Joint Discriminator

Content Extractor

Loss Function

Experiments

Training Data

Training Setups

Quantitative Metrics

Comparisons with State-of-the-art Methods

Quantitative Comparisions

Qualitative comparisions

Ablation Studies

Conclusions