ZhirAI Bootcamp

This repo serves as the starting point and documentation for training new models. Since tesseract has a large number of pre-trained models (.traineddata), we don't need to train models from scratch. We can finetune an existing model to improve its accuracy. To learn more about the options for training models read the official documentation.

Steps

Install tesseract 4.1 with the training binaries: Linux/Windows (and add it to PATH in windows). Note: It seems like version 5-alpha has a bug and can't be used for training yet.
```
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
```
Prepare the raw data and put it in langdata folder.
Place all fonts in the fonts folder.
Run find . -type f -print0 | xargs -0 dos2unix in terminal to fix line endings for all files.

Generate ground truth: (18 hours for 13.5 million lines total)

python3 -m pip install image
python3 -m pip install python-bidi
sudo chmod +x *.sh
nohup ./1-txt2lstmf.sh ckb > 1-ckb.log &

Run training (At least 24 hours): NOTE: Make sure number of characters in unicharset matches the one specified by the training script (2-train-layer.sh). More information.
```
nice --20 nohup ./2-train-layer.sh ckb > training.log &
```
Create best and fast .traineddata files from each .checkpoint file
```
make traineddata MODEL_NAME=ckb
```

Useful scripts:

# See available fonts in a folder
text2image --fonts_dir path/to/fonts --list_available_fonts

# Open a log file and scroll to the end:
less +G ./1-ckb.log

# Run 1-txt2lstmf in background:
nohup ./1-txt2lstmf.sh ckb > 1-ckb.log &

# To kill a process and see system resource usage:
htop

# Run txt2lstmf scripts:
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-1 > logs/1.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-2 > logs/2.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-3 > logs/3.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-4 > logs/4.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-5 > logs/5.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-6 > logs/6.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-7 > logs/7.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-8 > logs/8.log &

# Get disk info:
df

# Get directory size:
du -sh bootcamp

# Get number of files in a directory:
find bootcamp/gt -name '01_Sarchia_Abdulkareem.200.4*' | wc -l
find bootcamp/gt -type f | wc -l

Notes:

Make sure langdata/ckb/ckb.fontslist.txt has at least one font and an empty line at the end!

Troubleshooting

If you had trouble pushing your changes to the repository, run git config http.postBuffer 524288000.

Training Tesseract 4
Tesseract Data Files
Tesstrain wiki. Especially the article about Arabic Handwriting.
Kurdish Wikipedia Dumps
Create Custom Neural Net for hand writtern digits

hesta-io/Zhir-Bootcamp

ZhirAI Bootcamp

Steps

Notes:

Troubleshooting

Read more