/unicode-data

Python scripts that generate JavaScript-compatible Unicode data

Primary LanguageJavaScriptMIT LicenseMIT

Unicode test data for JavaScript

If you ever need JavaScript arrays of all Unicode symbols per category per Unicode version (for testing purposes, perhaps), or JavaScript-compatible regular expressions to match those symbols, this directory has got you covered. Because of the way JavaScript exposes “characters”, generating this data is trickier than it sounds, as you have to account for surrogate pairs.

For example, I’ve used a variation of this data in the following test case: http://mathias.html5.org/tests/javascript/identifiers/ It dynamically creates and runs over 90k tests, based on the appropriate Unicode categories and symbols.

Generated data

Per Unicode category, a number of separate files will be created:

  • ${version}/categories/${category}-code-points.js: a JavaScript-compatible array containing all numerical Unicode code points in that category.
  • ${version}/categories/${category}-symbols.js: a JavaScript-compatible array containing all Unicode symbols in that category as strings.
  • ${version}/categories/${category}-regex.js: a JavaScript-compatible regular expression that matches all Unicode symbols in that category.

The same thing is done for scripts, blocks, and properties.

The data is currently being generated for the following Unicode versions:

  • 2.0.14
  • 2.1.9
  • 3.0.1
  • 3.2.0
  • 4.0.1
  • 4.1.0
  • 5.0.0
  • 5.1.0
  • 5.2.0
  • 6.0.0
  • 6.1.0
  • 6.2.0
  • 6.3.0
  • 7.0.0
  • 8.0.0
  • 9.0.0
  • 10.0.0

I’ll update this repository (and this list) as soon as new Unicode versions are released.

How to generate the data

I’ve included the Python (v2.7.1) and Bash (v3.2.48) scripts I wrote to generate these files. I’m new to Python, so suggestions on how to improve these scripts are more than welcome!

To (re-)generate all data in this repository, run:

./generate.sh

Tests for the generated data

The generated data is fully tested by a script that verifies that, within the range of code points from 0x000000 to 0x10FFFF, only the symbols in ${version}/${category}-symbols.js are matched by the regular expression in ${version}/${category}-regex.js. This rather heavy test case (which runs over 33 million assertions) is available online.

HTTP API

I’ve set up an HTTP API of sorts, which allows you to customize the output a little bit. This saves you from downloading, editing, and re-hosting the generated files if you just want to write some quick tests. Here’s an example:

http://mathias.html5.org/data/unicode/format?version=6.2.0&category=Ll&type=symbols&prepend=window.symbols%20%3D%20&append=%3B

Available query string parameters

  • category: can be any Unicode category
  • script: can be any Unicode script
  • property: can be any Unicode property
  • block: can be any Unicode block
  • type: can be code-points, symbols or regex; defaults to symbols
  • version: can be any Unicode version for which data is available; defaults to the latest available version
  • prepend: a string to prepend to the output; defaults to the empty string
  • append: a string to append to the output; defaults to the empty string

Credits

Thanks to:

Author

Mathias Bynens