samkit-jain/Handwriting-Recognition

Can you point me to the data set you used to create the models?

jashshah opened this issue · 2 comments

I am primarily interested in the exact data you used to train the letters recognition. I am aware you used data from https://www.nist.gov/srd/nist-special-database-19 but can you point me

  1. to the exact data base used? There are many - by class, by merge etc and
  2. the file structure you used to input data data for training?

The English letters dataset was taken from NIST Special Database 19 Handprinted Forms and Characters 2nd Edition. You can download the zip here and more information here. SD19's images are 128x128 pixels. I converted them so that they are similar to MNIST.

Technique used,

The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field. -source

First, manually separated every image and moved it into its respective folder. Folder's name denoted the character whose images it contained. Then each converted image was saved as <character_name><integer>.png

Before conversion,

images
├── a
|   ├── 1.png
|   ├── 2.png
|   └── ...
├── b
|   ├── 1.png
|   ├── 2.png
|   └── ...
└── ...

After conversion,

fs2
├── a1.png
├── a2.png
├── a3.png
├── b1.png
└── ...

The codebase has been rewritten. There is now a script to automatically create the dataset.