ETLCDB_data_reader
A python package for conveniently loading the ETLCDB. The complete documentation including the API can be found here.
Intro
The ETLCDB is a collection of roughly 1.600.000 handwritten characters.
Notably it includes Japanese Kanji, Hiragana and Katakana.
The data set can be found on the ETLCDB website (a registration is needed to download the data set).
Because the data set is stored in a custom data structure it can be hard to load.
This python package provides an easy way to load this data set and filter entries.
An example of using this package can be found in my application: DaKanji. There it was used for training an CNN to recognize hand written Japanese characters, numbers and roman letters.
General information about the data set can be found in the table below.
name | type | content | res | Bit depth | code | samples per label | total samples |
---|---|---|---|---|---|---|---|
ETL1 | M-Type | Numbers Roman Symbols Katakana |
64x63 | 4 | JIS X 0201 | ~1400 | 141319 |
ETL2 | K-Type | Hiragana Katakana Kanji Roman Symbols |
60x60 | 6 | CO59 | ~24 | 52796 |
ETL3 | C-Type | Numeric Capital Roman Symbols |
72x76 | 4 | JIS X 0201 | 200 | 9600 |
ETL4 | C-Type | Hiragana | 72x76 | 4 | JIS X 0201 | 120 | 6120 |
ETL5 | C-Type | Katakana | 72x76 | 4 | JIS X 0201 | ~200 | 10608 |
ETL6 | M-Type | Katakana Symbols |
64x63 | 4 | JIS X 0201 | 1383 | 157662 |
ETL7 | M-Type | Hiragana Symbols |
64x63 | 4 | JIS X 0201 | 160 | 16800 |
ETL8 (8B) | 8B-Type | Hiragana Kanji |
64x63 | 1 | JIS X 0208 | 160 | 157662 |
ETL9 (8G) | 8G-Type | Hiragana Kanji |
128x127 | 4 | JIS X 0208 | 200 | 607200 |
ETL10 (9B) | 9B-Type | Hiragana Kanji |
64x63 | 1 | JIS X 0208 | 160 | 152960 |
ETL11 (9G) | 9G-Type | Hiragana Kanji |
128x127 | 4 | JIS X 0208 | 200 | 607200 |
Note:
The ETL6 and ETL7 parts include half width katakana which are stored as roman letters.
As an example: "ケ" is stored as "ke".
Those are automatically converted from this package.
Also full width numbers and letters are converted when using the package.
Example: 0 -> 0 and A -> A
Setup
First download the wheel from the releases page. Now install the wheel with:
pip install .\path\to\etl_data_reader_CaptainDario-2.0-py3-none-any.whl
Or install it directly via https:
pip install https://github.com/CaptainDario/ETLCDB_data_reader/releases/download/v2.1.4/etl_data_reader_CaptainDario-2.1.4-py3-none-any.whl
Assuming you already have downloaded the ETLCDB. You have to do some renaming of the data set folders and files. First rename the folders like this:
- ETL8B -> ETL1
- ETL8G -> ETL9
- ETL9B -> ETL10
- ETL9G -> ETL11.
Finally rename all files in the folders to have a naming scheme like:
- ETL_data_set\ETLX\ETLX_Y
(X and Y are numbers)
On the ETLCDB website is also a file called "euc_co59.dat" provided. This file should also be included in the "data set"-folder on the same level as the data set part folders.
The folder structure should look like this now:
ETL_data_set_folder (main folder)
| euc_co59.dat
|
|---ETL1
| ETL1_1
| |
| ETL1_13
|---ETL2
| ETL2_1
| |
| ETL2_5
|
|--- |
|
|---ETL10
| ETL10_1
| |
| ETL10_5
|---ETL11
ETL11_1
|
ETL11_50
Usage
Now you can import the package with:
import etldr
To load the data set you need an ETLDataReader
-instance.
path_to_data_set = "the\path\to\the\data\set"
reader = etldr.DataReader(path_to_data_set)
where path_to_data_set
should be the path to the main folder of your data set copy.
Example: "E:/data/ETL_data_set/"
Now there are basically three ways to load data.
Load one data set file
from etldr.etl_data_names import ETLDataNames
from etldr.etl_character_groups import ETLCharacterGroups
include = [ETLCharacterGroups.katakana, ETLCharacterGroups.number]
imgs, labels = reader.read_dataset_file(2, ETLDataNames.ETL7, include)
This will load "...\ETL_data_set_folder\ETL7\ETL7_2".
And store the images and labels which are either katakana or number in the variables imgs
and labels
.
Load one data set part
from etldr.etl_data_names import ETLDataNames
from etldr.etl_character_groups import ETLCharacterGroups
include = [ETLCharacterGroups.kanji, ETLCharacterGroups.hiragana]
imgs, labels = reader.read_dataset_part(ETLDataNames.ETL2, include)
This will load all files in the folder "...\ETL_data_set_folder\ETL2".
Namely: ...\ETL2\ETL2_1, ...\ETL2\ETL2_1 ,..., ...\ETL2\ETL2_5.
And store the images and labels which are either kanji or hiragana in the variables imgs
and labels
.
Load the whole data set
Warning: This will use a lot of memory.
from etldr.etl_character_groups import ETLCharacterGroups
include = [ETLCharacterGroups.roman, ETLCharacterGroups.symbols]
imgs, labels = reader.read_dataset_whole(include)
This will load all roman and symbol characters from the whole ETLCDB.
Load the whole data set using multiple processes
Warning: This will use a lot of memory.
from etldr.etl_character_groups import ETLCharacterGroups
include = [ETLCharacterGroups.roman, ETLCharacterGroups.symbols]
imgs, labels = reader.read_dataset_whole(include, 16)
This will load all roman and symbol characters from the whole ETLCDB using 16 processes.
Note: filtering data set entries
As the examples above already showed the loading of data set entries can be restricted to certain groups. Those groups can be seen in: etl_character_groups.py
Note: processing the images while loading
All of the above methods have the optional parameters:
resize : Tuple[int, int] = (64, 64)
and
normalize : bool = True
The resize
-parameter resizes all images to the given size.
The normalize
-parameter normalizes the grayscale values of the images between
Warning:
If those parameters are set to negative values no resizing/normalization will be done.
This will lead to an error if the data set is read with read_dataset_whole()
!
Limitations
This implementation does not allow to access all the stored data. Currently one can load:
- image
- label of the image
of every ETLCDB entry.
However this package should be easily extendable to add support for accessing the other data.
Development notes
For development python 3.9 was used.
documentation
The documentation was made with Sphinx and m2r.
m2r is being used to automatically convert this README.md to .rst.
This happens when the sphinx-build
-command is invoked in the 'docs'-folder.
Build the docs (should be run in docs folder):
sphinx-build source build
packages
A list of all packages needed for development can be found in 'requirements.txt'.
testing
Some simple test cases are defined in the tests folder.
Testing was only performed on Windows 10.
All tests can be executed with:
python tests\test_etldr.py
Specific tests can be run with:
python tests\test_etldr.py etldr.test_read_dataset_part_parallel
Those commands should be executed on the top level of this package.
building the wheel
The wheel can be build with:
python setup.py sdist bdist_wheel
Additional Notes
Pull requests and issues are welcome.
If you open a pull request make sure to run the tests before.