/siamese

One-shot learning for image classification using Siamese neural networks

Primary LanguageJupyter Notebook

One-shot learning with Siamese networks

Typical CNN classification methods involve a final fully-connected layer with neurons corresponding to the number of classes. This is suboptimal in situations where the number of classes is large, or changing.

In Siamese CNNs, we extract features from an image and convert it into an n-dimensional vector. We compare this n-dimensional vector with that of another image, and the model is trained such that images of the same class will produce similar vectors.

By comparing an unknown image against samples of labelled images, we are able to determine the labelled image which is most similar to the unknown image, and obtain a classification result. This provides Siamese networks with the ability to learn classification tasks with low training samples, as well as generalize to any number of classes.

Illustration of a Siamese network

Architecture

Much like a typical CNN, a Siamese CNN will have several convolutional layers, followed by fully-connected layers. The convolutional layers help to extract features from an image, before conversion into vectors for comparison.

Training

When training a Siamese CNN, we input two images, and a binary label indicating if the two images are of the same class. The last layer of the CNN is a fully-connected layer, which produces an n-dimensional vector. Subsequently, the output layer and the output vector will be used interchangably, and both refer to this layer. Depending on the label, the model will then try to minimize or maximize the distance between the vectors produced by the two images.

Note that the network that both images pass through are the same. This means that the weights and biases in the network for both images are identical throughout the training process.

Loss

In this project we experiment with two different kinds of loss functions. The loss is calculated based on the L1- or L2-distance between the outputs of the CNN (fully-connected layers) from the two images.

Loss with spring

In Dimensionality Reduction by Learning an Invariant Mapping the loss function as shown below is described. The following GitHub project is used as reference for the implementation of the loss function.

Siamese loss function

Sigmoid loss

Sigmoid loss for image recognition in Omniglot dataset is used in the paper Siamese Neural Networks for One-shot Image Recognition. The model architecture used in the paper is also the basis for the CNN for the Omniglot task.

MNIST

We start with MNIST to test our implementation. The model was trained with learning_rate=1e-4 over 20,000 iterations. The training results for several architectures are summarized below:

commit_hash conv. kernel size accuracy description
983a8a8 3x3 0.9758 2 layer FC + 2-neuron out
df5d2b9 5x5 0.9844 2 layer conv + 2 layer FC + 2-neuron out
df5d2b9 3x3 0.9856 2 layer conv + 2 layer FC + 2-neuron out
3757780 3x3 0.9890 2 layer conv + 2 layer FC (out)

Transfer learning

We first train a CNN on an MNIST classification task, achieving 99.37% accuracy on the test set. We then transfer the weights from the convolutional layers to the Siamese CNN before training the Siamese model with learning_rate=1e-4 over 10,000 iterations. This achieved a test accuracy of 98.99%, higher than the current maximum attained without transfer learning.

Testing

MNIST images for evaluation

For each of the ground truth images above, we obtain its output vector via the model. Then, for each image that we are evaluating, we obtain its output vector as well, then find the closest ground truth vector to it via L1- or L2-dist.

Omniglot

The Omniglot dataset is typically used for one-shot learning, as it contains a large number of classes, with few training samples per class.

While the training and testing classes were the same in MNIST, the Omniglot dataset allows us to test the model on completely different classes from the ones used in training.

A random seed of 0 was set for both the Python inbuilt random library, as well as Tensorflow.

Data

Training

Images in the images_background folder were used for training. For each class (e.g. Alphabet_of_the_Magi/character01), all possible combinations of pairs were appended to a list. For example, a class with 20 images yielded 20 choose 2 == 190 pairs.

n_samples number of pairs were then chosen at random from the possible pairs to form the training data for similar images. Subsequently, for each similar pair, we add a dissimilar pair by choosing two different classes at random, and choosing one image each from both classes. This ensures that the number of similar and dissimilar pairs are the same.

Testing

Images in the images_evaluation folder were used for testing. We use 20 classes (Angelic/character{01-20}) for testing, and determine accuracy by the number of correct predictions.

Results

Loss with spring

model_name n_samples n_iterations learning_rate dist accuracy
fc1 20 000 50 000 1e-5 L1 0.4025
fc1 20 000 50 000 1e-5 L2 0.4150
fc1 40 000 50 000 1e-5 L1 0.4000
fc1 40 000 50 000 1e-5 L2 0.4000
fc1_reg1 20 000 50 000 1e-5 L1 0.2700
fc1_reg1 20 000 50 000 1e-5 L2 0.2725
fc2 20 000 50 000 1e-5 L1 0.2875
fc2 20 000 50 000 1e-5 L2 0.2800
fc1

Single fully-connected layer with 4096 neurons.

fc1_reg1

Regularization with 2e-4 for convolutional layers.

fc2

Two fully-connected layer with 2048 neurons each, dropout=0.5 between fc1 and fc2. Number of neurons was reduced due to OOM allocations.

References

Implementation

Reading

Dataset