open-i18n/data

Add Unicode UCDXML data source

Opened this issue · 0 comments

Source

License

UNICODE, INC. LICENSE AGREEMENT - DATA FILES AND SOFTWARE
https://www.unicode.org/license.html

Open Questions

  1. Should we set up only one repo with the all (complete UCD) set, or set up addition one or two for nounihan and/or unihan ones?
  2. Do we need to include both grouped and flat files, or one is enough in the repo? If both, maybe they belong to two separate repos?

Other Notes

From https://www.unicode.org/Public/12.0.0/ucdxml/ucdxml.readme.txt:

While every effort has been made to ensure consistency of the 
XML representation with the UCD files, there may be some errors;
the UCD files are authoritative.


There are six files, available in zip/jar format; the size is that of
the archive:

                    flat         grouped

no Unihan data       897 KB          556 KB
Unihan data only   5,855 KB        5,862 KB
complete UCD       7,657 KB        6,420 KB

The flat versions do not use the group mechanism. The grouped versions
use the group mechanism, with groups corresponding approximately to
the blocks (a few blocks have been subdivided).

The "no Unihan data" files do not contain the properties expressed only
in the Unihan database. The "Unihan data only" files contain only
the properties and code points expressed in the Unihan database.
The "complete  UCD" files reflect the complete UCD data.```