Handwritten Character Disentanglement Benchmark compiled from EMNIST dataset
EMNIST class reads in handwritten character images and allows for flexible creation of word recognition datasets from the EMNIST letters found here : https://www.nist.gov/itl/iad/image-group/emnist-dataset
See examples.py for usage. Code below replicates dataset used in the paper : https://arxiv.org/abs/1710.03839 : Ver Steeg et. al "Disentangled Representations via Synergy Minimization"
all_words.txt contains words listed by occurence frequency from Peter Norvig, according to Google Web Trillion Word Corpus. http://norvig.com/ngrams/ (count_1w.txt)
from emnist import EMNIST
import numpy as np
length = 3
emnist = EMNIST()
top_words = emnist.top_words_of_length(length, max_words=300)
letters = emnist.top_letters_by_position(top_words, n = 8)
print('Length ', length, ' Words combined using letters: ')
print(letters)
train_words, test_words = emnist.valid_words_from_letters(letters)
x_train, y_train = emnist.get_data(train_words, data = 'train', per_word = 1,
resample_letters = 'none', save_all_imgs = True)
# Here, we treat the test set as using the same letter images but consisting of invalid English words.
# Set data = 'test' to sample EMNIST test set images
x_test, y_test = emnist.get_data(test_words, data = 'train', per_word = 1, resample_letters = 'none')
-
top_words_of_length ( length , max_words , file_path , return_probabilities)
Choose word length and number of words, returns list of word strings and, optionally, relative occurence probabilities normalized amongst chosen words. These may be fed to "get_data" directly to pull images, or used to choose commonly occuring letters in each character position.
-
top_letters_by_position ( words, n )
Choose top n letters occuring in each of the i positions in words list. Returns list (length = word length) of lists of letters (length = n)
-
valid_words_from_letters ( letters )
Given list of lists of letters to go in each position, find split all combinations of letters into validly defined words (train), and not (test). Can be used with top_letters or custom letter choices by position.
-
get_data (words, data = 'train'/'test', per_word , resample_letters)
Construct dataset of images by specifying word list, whether to take images from EMNIST training or test data, and how many samples per_word (can be an integer or 1d array with indices matching word list).
- resample_letters = 'none': Same letter image for each instance of a letter in a word
- resample_letters = 'all' : Resample image for each instance of a letter in word
- resample_letters = 'words' : Use same letter image within same word sample : i.e same 'p' within word 'pip'