pnavarre/mgbpv2

in which part of the network you trained on all the SR factors?

Closed this issue · 7 comments

Hello,

Is it possible to share your training code? if not could you please tell me which part of the network was used to train on all the different factors? for example L1(SR2,HR2) + L1(SR4,HR4) + L1(SR8,HR8)

The second question. Could you share your training strategy regarding the sampling of the different factors while training one network? Did you train twice on different sampling or the training was done on the only one downsampled version to estimate all the different factors?

I will refer to MGBPv2 as in https://arxiv.org/pdf/1909.12983.pdf (note that MGBPv1 was different and more complicated in this sense):
Q1. There is only one network, one input and one output per patch. The input domain is upscaled with bicubic upscaler to have the same resolution of the output domain. The loss function (say eq.1 or eq.3 in the paper) downscales the output patch with different factors.
Q2. Based on the answer of Q1 the sampling strategy is straightforward and ignores the different downscaling factors in the loss function. It randomly crops patches of the same size in input and output images.

Thanks for your answer. I think this is what was not clear in the MGBPv1 and MGBPv2.
In MGBPv1 the loss was on all the upscaling factors. such as L1(SR2,HR2) + L1(SR4,HR4) + L1(SR8,HR8). because each latent features were converted to 3 channels before entering the second level. but in MGBPv2 this part wasn't clear so the loss function in eq.1 and eq.2 was understood as a loss on the upsampled factors.

I think what is missing here and in MGBPv1 is the loss of the downsampled version. on which part of the network the loss was applied? and what was the downsampling procedure ? is it one of the downsampling modules or did you use bicubic downsampler?

My second question is for no mean in this version here since you only have one single input and one single output. but can you please answer the same question on MGBPv1?

For MGBPv2 the loss is applied in the only output of the network, always at high-resolution. We used a bicubic downsampler module (implemented using a single convolutional layer). So, an output patch gets downscaled and then compared to the downscaled original patch.

In MGBPv1 the network has one input and several outputs (2x, 4x, 8x, etc). Here, we followed closely the training strategy and loss function of MSLapSR. Sampling is more challenging. For one input patch, we need to find the correspondent patches at 2x, 4x, 8x and need careful consideration on cropping coordinates. Alternatively, you can just crop the high-resolution in the original image, downscale it, and then compare to the correspondent output of the network. The latter might lead to border issues.

Thanks again for your explanation.
Does this mean that you built a sequence of bicubic downsampler modules or just one downsampling by 16X? was the module trainable during the training or fixed parameters?

The latter might lead to border issues.

This is very true. But I'm sorry I think I still can't get the way you did it. Let's assume you have HR, so you downsample x2 and x4. this will give you LR_2 and LR4.
if we upscale LR_4 by 2x to something let's call it intermediate what is the GT for this output? also downsampling HR -> LR_2 -> LR_4 is not the same as HR -> LR_2, HR -> LR_4.
I can't find an explanation for this procedure. Could you please help me here?

Yes, you need a series of downscaling modules with fixed parameters (not learnable).
Again, here we refer to MGBPv1 (one input and several outputs) and not MGBPv2. Border issues come from the fact that downscaling a complete HR image and then cropping it, is not the same (in general) as cropping the HR image and then downscale it. So, here the correct way is to downscale first, crop second. For that I would recommend to write a tailor-made sampler module.
The GT for intermediate outputs is always a downscale version of the original HR image (direct downscaling and not two-step downscaling as you correctly pointed out), assuming a standard downscaling degradation (typically bicubic).

I didn't get the idea of the down -> crop or crop -> down.
I'll use notations to make it simpler.
We have a model to upsample by x2 and by x4. consists of M1 for x2 and M2 also x2. the output of M1 is the input of M2 which makes the output x4.

we have an input image (HR), we create LR_2 using HR-> LR_2 and LR_4 using HR -> LR_4.

The training is the following:
(1) Lr_4 -> M1 -> intermediate_SR_2
(2) intermediate_SR_2 -> M2 -> SR_4
The loss is:
L = L(intermediate_SR_2, LR_2) + L(SR_4, HR)

Here I see the problem. we are training M1 to take a downscaled image by x4 and output a downscaled x2. Then this unnatural intermediate is taken as input to produce SR.
There is a missing link here between LR_2 -> HR. instead, we are doing intermediate_SR_2 -> HR. what I mean is we are not training on factor two. the GT of factor x2 is an unnatural image( downsampled image).

Could you please share your data loader where you assembled the image pairs and the downsampling bicubic convoloution kernel?