Always missing one character in the output
manhcuogntin4 opened this issue · 3 comments
manhcuogntin4 commented
I used pyclstm for trainning my dataset on date but the output always missing the character 9. All the rest character is Ok. Hope for help !
kba commented
Can you provide sample images, sample output, what data you trained on, parameters used for training / recognition?
manhcuogntin4 commented
@kba thank for your answer !
the output of test is 24/07/16
the output of test is 0/0/166
And here is my code:
all_imgs = [Image.open(p) for p in sorted(glob.glob("./book/*/*.png"))]
all_texts = [open(p).read().strip() for p in sorted(glob.glob("./book/*/*.gt.txt"))]
if sys.version_info <= (3,):
all_texts = [t.decode('utf8', "replace") for t in all_texts]
all_data = list(zip(all_imgs, all_texts))
train_data = all_data[:5500]
test_data = all_data[5500:]
len(all_data)
ocr = pyclstm.ClstmOcr()
graphemes = set(chain.from_iterable(all_texts))
#graphemes=('0','1','2','3','4','5','6','7','8','9', '/','*')
ocr.prepare_training(graphemes)
for i in range(100000):
best_error = 1.
img, txt = random.choice(train_data)
out = ocr.train(img, txt)
if not i % 10:
aligned = ocr.aligned()
print("Truth: {}".format(txt))
print("Aligned: {}".format(aligned))
print("Output: {}".format(out))
if not i % 1000:
errors = 0
chars = 0
for img, txt in test_data:
out = ocr.recognize(img)
errors += pyclstm.levenshtein(txt, out)
chars += len(txt)
error = errors / chars
print ("=== Test set error after {} iterations: {:.2f}"
.format(i, error))
if error < best_error:
print("=== New best error, saving model to model_date_permis.clstm")
ocr.save("./model_date_permis.clstm")
lomograb commented
Train more = Better result