- For style, the earlier layers provide for a more "localized" representation. This is opposed to the content model, where the later layers represent a more "global" structure.
- VGG19 features portion is the CNN layers
- keeping requires grad = False keeps the parameters unaffected during gradient decent and backpropogation
- torch.device function to use GPU
- connect the model to the device
- Unsqueeze adds an extra dimension (just like np.expand.dims) (batch size, H, W, channel)
- Image is resized to the maximum (400 pixel this case) dimension
- Image is normalized to mean=0.5 and stdv= 0.5
- Image is converted to a tensor
- Convert the image into 'RGB' format
- Load the images and converted into TENSOR
- Pass in the shape parameter so as the two images are of the same size
- Tensor passed in is in shape (batch_size, color channel, H, W)
- Clone before to converting to numpy array
- then squeeze the batch_size dimension
- Transpose the array in shape (H,W, color channel) for matplotlib
- denormalize the image
- Clip the image so that it is in 0 to 1 range
- Initializing the layers that will be used for content and style output
- Create a dictionary to store the output of the image at each layer
- Create a feature dictionary for content image
- Create a feature dictionary for style image
- Create a gram matrix function
- Loop in each output of the style features and create a gram matrix dictionary.
Gram Matrix :
- Input Shape: (H,W,C)
- It’s then converted into (H*W, C) named as X
- Multiply with its transpose.
- Resulting Shape: (C,C)
- This makes the network loses the spatial information and only keep the style features of the style image
- In style features earlier layers are more important than the later layers opposite of the content extraction
- Ratio is set for each of the style layer outputs
- also alpha/beta is the ratio of content_weight/style weigh can have a big affect on the final image transfer
- Initializing the Input Image
- This can be random noise for an image but we will be taking the clone of the content image
- Using the requires_grad_(True) parameter to optimize the input image wrt to style and content images.
- show_every = 300, to show our style transfer progress at every 300 steps.
- Initializing the optimizer to optimize the input image
- Run the optimization for 300 steps to get decent results
-
Create a loop to optimize for "steps" count.
-
calculate the features using the input image and VGG model.
-
calculate the content loss using MSE.
$$Content Loss = mean((target - Input)**2)$$ - Here the target is the predicted value from the content image with pre-trained weights and bias, (like the one image where to want to copy the content from eg image of a cricketer MS Dhoni here.)
- Input is the randomly initialized image (starting with randomly initialized value and trying to minimize the loss with respect to the content image (target) so that we capture/copy good features from the content image to the randomly initialized image.
-
calculate the style loss
- Calculate the gram matrix for input image and the target image at each of the convolution (the one which is randomly initialized and the style image)
- At each step calculate the difference between the two and MSE on top of it calling the style loss function and sum it up at the end of all conv blocks
$$Style Loss = mean((gram_matrix(y) - grammatrix(t))**2)$$ -
Calculate total loss and apply the weights
- Update the gradients with respect to the input image
- These input image is optimized to make it similar to the content image at the chosen convolution (5th out of 16 here in this example, can be anything) and style is optimized from the style image using the loss optimization at each of the chosen convolution.
Here Loss is calculated after each convolution using the Style Loss function
The gradients are calculated, loss with respect to input image and not the parameters like (W and b in neural networks).