stype_transfer: A Jupyter Notebook repository from ankit-kothari

pytorch Implementation

For style, the earlier layers provide for a more "localized" representation. This is opposed to the content model, where the later layers represent a more "global" structure.

VGG19 features portion is the CNN layers
keeping requires grad = False keeps the parameters unaffected during gradient decent and backpropogation
torch.device function to use GPU
connect the model to the device

Unsqueeze adds an extra dimension (just like np.expand.dims) (batch size, H, W, channel)
Image is resized to the maximum (400 pixel this case) dimension
Image is normalized to mean=0.5 and stdv= 0.5
Image is converted to a tensor

$$style loss = mean((gram_matrix(y) - grammatrix(t))**2)$$

Gram Matrix :

Input Shape: (H,W,C)
It’s then converted into (H*W, C) named as X
Multiply with its transpose.
Resulting Shape: (C,C)
This makes the network loses the spatial information and only keep the style features of the style image

$$X^TX/N$$

In style features earlier layers are more important than the later layers opposite of the content extraction
Ratio is set for each of the style layer outputs
also alpha/beta is the ratio of content_weight/style weigh can have a big affect on the final image transfer

Initializing the Input Image
This can be random noise for an image but we will be taking the clone of the content image
Using the requires_grad_(True) parameter to optimize the input image wrt to style and content images.

Create a loop to optimize for "steps" count.
calculate the features using the input image and VGG model.
calculate the content loss using MSE.

$$Content Loss = mean((target - Input)**2)$$
- Here the target is the predicted value from the content image with pre-trained weights and bias, (like the one image where to want to copy the content from eg image of a cricketer MS Dhoni here.)
- Input is the randomly initialized image (starting with randomly initialized value and trying to minimize the loss with respect to the content image (target) so that we capture/copy good features from the content image to the randomly initialized image.
calculate the style loss
- Calculate the gram matrix for input image and the target image at each of the convolution (the one which is randomly initialized and the style image)
- At each step calculate the difference between the two and MSE on top of it calling the style loss function and sum it up at the end of all conv blocks
$$Style Loss = mean((gram_matrix(y) - grammatrix(t))**2)$$
Calculate total loss and apply the weights

$$totalloss = contentloss + styleloss *styleweight$$

These input image is optimized to make it similar to the content image at the chosen convolution (5th out of 16 here in this example, can be anything) and style is optimized from the style image using the loss optimization at each of the chosen convolution.

Here Loss is calculated after each convolution using the Style Loss function

The gradients are calculated, loss with respect to input image and not the parameters like (W and b in neural networks).