ETLCDB_data_reader

A python package for conveniently loading the ETLCDB. The complete documentation including the API can be found here.

Intro

The ETLCDB is a collection of roughly 1.600.000 handwritten characters. Notably it includes Japanese Kanji, Hiragana and Katakana. The data set can be found on the ETLCDB website (a registration is needed to download the data set).
Because the data set is stored in a custom data structure it can be hard to load. This python package provides an easy way to load this data set and filter entries.
An example of using this package can be found in my application: DaKanji. There it was used for training an CNN to recognize hand written Japanese characters, numbers and roman letters.
General information about the data set can be found in the table below.

name	type	content	res	Bit depth	code	samples per label	total samples
ETL1	M-Type	Numbers Roman Symbols Katakana	64x63	4	JIS X 0201	~1400	141319
ETL2	K-Type	Hiragana Katakana Kanji Roman Symbols	60x60	6	CO59	~24	52796
ETL3	C-Type	Numeric Capital Roman Symbols	72x76	4	JIS X 0201	200	9600
ETL4	C-Type	Hiragana	72x76	4	JIS X 0201	120	6120
ETL5	C-Type	Katakana	72x76	4	JIS X 0201	~200	10608
ETL6	M-Type	Katakana Symbols	64x63	4	JIS X 0201	1383	157662
ETL7	M-Type	Hiragana Symbols	64x63	4	JIS X 0201	160	16800
ETL8 (8B)	8B-Type	Hiragana Kanji	64x63	1	JIS X 0208	160	157662
ETL9 (8G)	8G-Type	Hiragana Kanji	128x127	4	JIS X 0208	200	607200
ETL10 (9B)	9B-Type	Hiragana Kanji	64x63	1	JIS X 0208	160	152960
ETL11 (9G)	9G-Type	Hiragana Kanji	128x127	4	JIS X 0208	200	607200

Note:
The ETL6 and ETL7 parts include half width katakana which are stored as roman letters. As an example: "ｹ" is stored as "ke". Those are automatically converted from this package. Also full width numbers and letters are converted when using the package. Example: ０ -> 0 and Ａ -> A

Setup

First download the wheel from the releases page. Now install the wheel with:

pip install .\path\to\etl_data_reader_CaptainDario-2.0-py3-none-any.whl

Or install it directly via https:

pip install https://github.com/CaptainDario/ETLCDB_data_reader/releases/download/v2.1.4/etl_data_reader_CaptainDario-2.1.4-py3-none-any.whl

Assuming you already have downloaded the ETLCDB. You have to do some renaming of the data set folders and files. First rename the folders like this:

ETL8B -> ETL1
ETL8G -> ETL9
ETL9B -> ETL10
ETL9G -> ETL11.

Finally rename all files in the folders to have a naming scheme like:

ETL_data_set\ETLX\ETLX_Y
(X and Y are numbers)

On the ETLCDB website is also a file called "euc_co59.dat" provided. This file should also be included in the "data set"-folder on the same level as the data set part folders.

The folder structure should look like this now:

ETL_data_set_folder (main folder)
|   euc_co59.dat
|
|---ETL1
|       ETL1_1
|          |
|       ETL1_13
|---ETL2
|       ETL2_1
|          |
|       ETL2_5
|
|--- |
|
|---ETL10
|       ETL10_1
|          |
|       ETL10_5
|---ETL11
        ETL11_1
           |
        ETL11_50

Usage

Now you can import the package with:

import etldr

To load the data set you need an ETLDataReader-instance.

path_to_data_set = "the\path\to\the\data\set"

reader = etldr.DataReader(path_to_data_set)

where path_to_data_set should be the path to the main folder of your data set copy.
Example: "E:/data/ETL_data_set/"

Now there are basically three ways to load data.

Load one data set file

from etldr.etl_data_names import ETLDataNames
from etldr.etl_character_groups import ETLCharacterGroups

include = [ETLCharacterGroups.katakana, ETLCharacterGroups.number]

imgs, labels = reader.read_dataset_file(2, ETLDataNames.ETL7, include)

This will load "...\ETL_data_set_folder\ETL7\ETL7_2".

And store the images and labels which are either katakana or number in the variables imgs and labels.

Load one data set part

from etldr.etl_data_names import ETLDataNames
from etldr.etl_character_groups import ETLCharacterGroups

include = [ETLCharacterGroups.kanji, ETLCharacterGroups.hiragana]

imgs, labels = reader.read_dataset_part(ETLDataNames.ETL2, include)

This will load all files in the folder "...\ETL_data_set_folder\ETL2". Namely: ...\ETL2\ETL2_1, ...\ETL2\ETL2_1 ,..., ...\ETL2\ETL2_5.

And store the images and labels which are either kanji or hiragana in the variables imgs and labels.

Load the whole data set

Warning: This will use a lot of memory.

from etldr.etl_character_groups import ETLCharacterGroups

include = [ETLCharacterGroups.roman, ETLCharacterGroups.symbols]

imgs, labels = reader.read_dataset_whole(include)

This will load all roman and symbol characters from the whole ETLCDB.

Load the whole data set using multiple processes

Warning: This will use a lot of memory.

from etldr.etl_character_groups import ETLCharacterGroups

include = [ETLCharacterGroups.roman, ETLCharacterGroups.symbols]

imgs, labels = reader.read_dataset_whole(include, 16)

This will load all roman and symbol characters from the whole ETLCDB using 16 processes.

Note: filtering data set entries

As the examples above already showed the loading of data set entries can be restricted to certain groups. Those groups can be seen in: etl_character_groups.py

Note: processing the images while loading

All of the above methods have the optional parameters:
resize : Tuple[int, int] = (64, 64)
and
normalize : bool = True
The resize-parameter resizes all images to the given size.
The normalize-parameter normalizes the grayscale values of the images between $[0.0, 1.0]$.

Warning: If those parameters are set to negative values no resizing/normalization will be done.
This will lead to an error if the data set is read with read_dataset_whole()!

Limitations

This implementation does not allow to access all the stored data. Currently one can load:

image
label of the image

of every ETLCDB entry.

However this package should be easily extendable to add support for accessing the other data.

Development notes

For development python 3.9 was used.

documentation

The documentation was made with Sphinx and m2r. m2r is being used to automatically convert this README.md to .rst. This happens when the sphinx-build-command is invoked in the 'docs'-folder.
Build the docs (should be run in docs folder):

sphinx-build source build

packages

A list of all packages needed for development can be found in 'requirements.txt'.

testing

Some simple test cases are defined in the tests folder. Testing was only performed on Windows 10.
All tests can be executed with:

python tests\test_etldr.py

Specific tests can be run with:

python tests\test_etldr.py etldr.test_read_dataset_part_parallel

Those commands should be executed on the top level of this package.

building the wheel

The wheel can be build with:

python setup.py sdist bdist_wheel

Additional Notes

Pull requests and issues are welcome.

If you open a pull request make sure to run the tests before.