Homework Assignment 4 : Neural Style Transfer


Table of Content

Introduction


In the past, for creating an image in a particular style and form, people used to have need of specialised artist and spend lot of time and money for that. Since last few decades, even computer researchers have taken a keen interest in this field and have developed techniques like non-photorealistic rendering(NPR) which can only be used for a limited number of styles, image analogies for a style transfer by differentiating the stylized and unstylized images.The most recent method of neural style transfer using convolutional neural network is the most prominent one. Neural Style Transfer(NST) works on the idea that a CNN can separate the style and content representation within an image during a task(specifically computer vision task). By taking in two input images, style image and content image and then forming an output image that has the content of the content image but the style of the style image.

Approach


For fast and easy execution of the models' operations; eager execution command has been executed to save the time in running the models. This also helps in debugging the code faster.

  • Layers used and details about layers
    For fetching the content and the style features from the input images, the results of the intermediate layers are used. The feature maps become complex as we move from lower to higher order layers in the network.

  • Content Feature Extraction
    As we move deeper into the network, the features extracted from the images contain more information about the content within the image as compared to the details in the pixel values. So, mainly the features extracted from the higher layers are used as a content representation required for the formation of the output image. Usually higher layer from the last set of convolutional layers is used as a content extraction.

  • Style Feature Extraction
    For style representation used in the formation of the output image, the correlation among the different feature maps from the kernel results from the convolutional layers is done which can take out the texture information(local details) without taking out the global details. These correlations are known as gram matrices. Usually the first layer in every set of convolutional layers are used for correlation with varying weight for each layer. This can help in varying the style levels.

  • Architectures implemented
    For extracting the feature maps for extracting style and content representation use of the pre-trained model is done. The output of the model is used in for fetching the layers for content and style extraction. Here, we have tried implementing the following architectures:-

    • VGG-16 architecture
    • VGG-19 architecture
    • ResNet50 architecture
  • Loss function and optimisation

    • Content Loss Function
      In order to make sure that the difference between the content of the generated output image and the content image is minimised, the content loss function is defined using the Minimum Square Error(MSE) loss.

- Style Loss Function
In order to make sure that the difference between the texture of the generated output image and the style image is minimised, the style loss function is defined using gram matrices concept.
Gram matrix is a matrix which is used in the calculation of the correlation between the channels of the same convolutional layer used for style features extraction. The output suggests how much degree of correlation is there within the channels with respect to each other.
The gram matrices of style image and generated output image of same layers are compared using square difference function to minimise the loss function. Defining the gram matrices loss:

As there are multiple layers involved in extracting the style and in the loss function, weights are assigned to loss function of every layer which finally gives the style loss function.

- Complete Loss Function
In order to make sure that the generated output image is similar to the content image in terms of their content and to style image in terms of their style and not the complete style image, loss function is introduced in the implementation with two separate parts content loss and style loss for calculating loss. With every iteration, the goal is to minimise the overall loss function so the we get the desired output image.

The parameters alpha and beta act as weights for controlling the amount of content and style features to be added into the generated output image.
- Gradient Descent for optimisation
Using gradient descent approach for minimising the loss is generally done which can help in generating more informative output image. Here, we have implemented the Adam optimiser which can be used for backpropagation which update the hyperparameters after every iteration and optimises the loss function.

Results


The aim of developing the neural style transfer method is that in order to combine the content and the texture(style) of different images into a single image only addition of the images does not work; we need to design a model which can learn all content and style features and apply it to create the desired output image. This difference can be noted as below:

Neural Style Transfer Simple Addition of Images
  • Different style images
Content Image Style Image Output Image
- For different style image, different texture is applied to the output image.
- If the style image has dark texture, then even if the content image is lighter in shade then also the output image contains darker texture retaining the same content.
- If the texture of the style image is completely different from the content image, then the output image will contain the different texture.
  • Different resolution content images
    Comparison between output images for different size(128X128, 256X256, 512X512, 1024X1024) of same content image provides the following results:
- As the content image size increases, the features related to the content are extracted from the image with increase in time complexity.
- The output image contains more information of the features as the size increases.(more blur to less blur)
- The boundary detection of the images is least for low resolution and most for high resolution.
- More style features are visible in high resolution image as compared to low resolution image.
  • Different algorithms
    The comparison on applying different types of architecture models on same content and style image are:
VGG-16 VGG-19 Resnet-50
The difference between the architecture complexities is as follows:
Comparison VGG-16 VGG-19 Resnet-50
Time Complexity 0.0604s 0.0679s 0.0974s
Space Complexity Low Low High
Steps Required Less Less More
- As there are more layers in VGG19, so more robust output image is obtained compared to VGG16, while ResNet50 cannot be preferred as it is not able to extract the style features.

Platform


  • Google Colab

Installation guidelines


  • To clone this repository
git clone https://github.com/DipikaPawar12/CV_Assignment6-7_Aanshi_Dipika.git
  • To install the requirements
pip install -r requirements.txt
  • To mount the drive
from google.colab import drive
drive.mount('/content/drive')
  • For content and style images(i.e. source and target image),
    • Either use images of the images folder
    • Or search for another images online

References


[1] Neural Style Transfer: Creating Art with Deep Learning using tf.keras and eager execution
[2] M. B. L. Gatys, A. Ecker, A Neural Algorithm Of Artistic Style.
[3] Z. F. J. Y. Y. Y. Yongcheng Jing, Yezhou Yang and M. Song, Neural Style Transfer: A Review.
[4] J. L. Y. Li, N. Wang and X. Hou, Demystifying Neural Style Transfer.
[5] J. Y. Z. W. X. L. Yijun Li, Chen Fang and M.-H. Yang, Universal Style Transfer Via Feature Transforms.

Contributors


| Dipika Pawar | Aanshi Patwari |