/cjk-decomp

Decomposition data for 75,000 CJK ideographs; fork (with fixes) of

Apache License 2.0Apache-2.0

Note: This repo is unmaintained. For better data in a standard format that is actively maintained, I strongly recommend using cjkvi-ids instead (see ids.txt)

CJK Decomposition Data

The CJK Decomposition Data File is a graphical analysis of the approx 75,000 Chinese/Japanese characters in Unicode.

The data file is in UTF-8 encoding and was originally compiled by Gavin Grover (original project). It is distributed under 6 licenses, of which you only need choose one:

The data comprises the 36 strokes (U+31C0..U+31E3), the 115 radicals (U+2E80..U+2EF3, except U+2E9A), the 20,924 unified characters (U+4E00..U+9FBB), the 12 unique characters from the compatibility range (U+F900..U+FAD9), the 6,582 extension A characters (U+3400..U+4DB5), the 42,711 extension B characters (U+20000..U+2A6D6), the 4,149 extension C characters (U+2A700..U+2B734), and the 222 extension D characters (U+2B740..U+2B81D).

Each record has 3 fields, viz, the character being defined, the type of decomposition, and a list of zero or more constituent components, like so:

的:a(白,勺)

The character being defined and the constituent components are either a Unihan token, in the basic or a supplemental plane, or a 5-digit number representing an intermediate decomposition not in Unicode. There are approx 10,000 such intermediate decompositions.

If you need a font, you can use the Hanazono font.

Only pictorial configurations are used, not semantic ones. Where characters have typeface differences I've used the one provided by the Mainland Chinese contribution to Unicode. When there's more than one possible configuration, I've selected one only.

The possible configurations and their meanings are:

Code regex Meaning Number of possible constituents
c component 0
m.* modified in some way, e.g. me=equivalent, msp=special, mo=outline, ml=left radical version 1
w.* second constituent contained within first in some way, e.g. w=within at the center, wbl=within at bottom left 2
ba|d second between first moving across or downwards 2
lock components locked together 2
s.* first component surrounds second, e.g. s=surrounds fully, str=surrounds around the top-right 2
a flows across >=2
d flows downwards >=2
r.* repeats and/or reflects in some way, e.g. refh=reflect horizontally, rot=rotate 180 degrees, rrefr= repeat with a reflection rightwards, ra=repeat across, r3d=repeat 3 times downwards, r3tr=repeat in a triangle, rst=repeat surrounding around the top 1

The s, a, d, and r codes may be followed by /t, /m, /s, or /o, to show whether the join touches, molds, snaps together, or overlaps, respectively.

Some more work needs to be done, including reducing the quantity of intermediate components by removing duplicates, lowering the number of components in many sequences, reanalysis of decomposition configurations, and of course quality checking and corrections.

See also