This project is meant to provide a reference implementation of the siamese network architecture described in this paper, as well as provide a novel re-organizing and pre-processing method for the IAM handwriting dataset. It was was completed as the final project for CU Boulder CSCI 5922: Neural Networks and Deep Learning.
Project Collaborators:
- Lawrence Hessburg (lawrence.hessburgiv@colorado.edu)
- Poorwa Hirve (poorwa.hirve@colorado.edu)
- Prathyusha Gayam (prathyusha.gayam@colorado.edu)
- Payoj Jain (payoj.jain@colorado.edu)
To get started training and testing the network yourself right away, jump to the How to run section
A Siamese Neural Network is a class of neural network architectures that contain two or more identical sub networks. 'Identical' here means that they have the same configuration with the same parameters and weights. Parameter updating is mirrored across both sub networks. It is used to find the similarity of the inputs by comparing its feature vectors.
As an application, we have implemented a Siamese convolutional neural network to determine whether two pieces of handwritten documents are written by the same author or not. While implementing this project, we faces a lot of challenges from organising the dataset to building a model which performs really well for our problem statement. Among all the architectures we tries and trained, ResNetSiamese performed the best.
We also tested fake handwritten generated images, which were generated using Cycle GANs, with the real ones on our model and learned that with more images our model will certainly perform well in differentiating fake from real ones.
The basic network architectures are adapted from the Stanford paper mentioned above. If we talk about basic network structure, two inputs (image A and image B) are fed as inputs to two identical CNNs. The output encodings from these two images are then concatenated, which is then fed to a fully connected layer to get the class scores.
We concatenate to get the following vector, expanding an
This identical network of the Siamese CNN consists of:
- Conv layer with 32 filters, followed by ReLU and MaxPool 2x2
- Conv layer with 64 filters, followed by ReLU and MaxPool 2x2
- Conv layer with 64 filters, followed by ReLU
- Fully connected layer with 400 hidden units
- Dropout with probability = 0.5
- Fully connected layer with 200 hidden units
- L2 Regularization
This consists of tiny blocks consisting of:
- Conv layer
- Batch normalization
- ReLU
Once the basic building block is built, the following architecture is created:
- One ResNet unit with 16 filters
- Two ResNet units with filter size 16
- Two units of filter size 32 with stride 2
- Two units of size 64 with stride 2
- Fully connected layer which brings the output to 10 dimensions
We used IAM Handwriting Database (http://www.fki.inf.unibe.ch/databases/iam-handwriting-database) for training and testing. It has various formats for handwritten datasets from 657 writers and 1539 pages of scanned text. The original dataset is not well organized for authorship verification purposes. We have re-organized a subset of the dataset to make author-based tasks mush more efficient. This process is outlined below.
- Using
forms.txt
from the IAM Handwriting Database, downloadlines.txt
and get the authors of each file.
forms.txt
contains information about each sample: author, sentence, etc.
- Restructure the dataset into top 100 authors.
- Keep a threshold (we have considered 1000 px) such that we get a significant amount of data, i.e. lines instead of words.
- We do this because ours is a handwriting recognition problem, the more amount of words in the data the better.
- We have stored this data on a public server Authors.zip.
- Download this along with the dependencies by running
./install_dependencies.sh
- UPDATE: The preprocessed data is no longer available.
- Download this along with the dependencies by running
- Once we get the top 100 authors, we generate the training data using
create_pairs.py
. This divides images randomly into pairs.
- author1 author2 1/0 -->
1 if author1 == author2 else 0
- We now have our
train.txt
files according to various sizes. - Repeat steps 3 and 4, for validation files:
valid.txt
according to various sizes.
Now we have our train-test files according to the data preprocessing steps.
Our results can be seen in the results directory. These include graphs and figures from our models.
To generate a table for accuracy, true positive rate and true negative rate, run test scripts for either of the models.
Refer to Data Preprocessing for downloading data and dependencies.
All package requirements are in requirements.txt
and can be installed by running install_dependencies.sh
The Authors dataset can be downloaded by running download_data.sh
in the Dataset
directory.
Model_checkpoints/best
is a checkpoint from training 20 epochs on 40,000 training pairs, it is available for testing or as pretraining.
In train.py
there are several hyperparameters available to adjust declared at the top of the script
LEARNING_RATE
: The initial learning rate for Adam optimizationBATCH_SIZE
: The minibatch size used during trainingTHRESHOLD_VALUE
: The thresholding level for removing scanning artifacts. In a given input image, any pixel value higher thanTHRESHOLD_VALUE
will be set to white.CROP_SIZE
: The horizontal size to crop input images to. All input images start >= 1000 pixels wide, soCROP_SIZE
must be < 1000. NOTE: Changing this value will require a change to the fully connected layer size of the TinyResnet model. Proceed with caution.RANDOM_CROP
: Flag determining whether to crop images from a random location, or starting at 0. For example, if the original image is 1500 pixels wide,CROP_SIZE
is 700 andRANDOM_CROP
is True, the result will be a 700 pixel wide window of the original image in a random location.DOWNSAMPLE_RATE
: Downsampling ratio to speed up training. Must be between 0 and 1
After getting data in the appropriate paths, run train scripts either train.py
or trainBaseline.py
with the following flags and arguments:
-c
or--cuda
to run on GPU-e EPOCHS
--load_checkpoint PATH_TO_MODEL_CHECKPOINT
data_path
path to the .txt file containing training pairs- e.g.:
python3 train.py Dataset/train_100.txt -c -e 50
Once the model finishes training, checkpoints will be stored either in the checkpoints directory. You can use them to test using either test.py
or testBaseline.py
with the following flags and arguments:
-c
or--cuda
to run on GPU-e EPOCHS
data_path
path to the .txt file containing testing pairsload_checkpoint
with path to the model checkpoint- e.g.:
python3 valid.py Dataset/valid_100.txt Model_Checkpoints/epoch20 -c
We are using Cycle-GAN. It can be trained and tested using the script ./do_gan_stuff
in the GAN directory.
Once the images are obtained, you can run through the model using the same command mentioned in Validation.