If you ever need JavaScript arrays of all Unicode symbols per category per Unicode version (for testing purposes, perhaps), or JavaScript-compatible regular expressions to match those symbols, this directory has got you covered. Because of the way JavaScript exposes “characters”, generating this data is trickier than it sounds, as you have to account for surrogate pairs.
For example, I’ve used a variation of this data in the following test case: http://mathias.html5.org/tests/javascript/identifiers/ It dynamically creates and runs over 90k tests, based on the appropriate Unicode categories and symbols.
Per Unicode category, a number of separate files will be created:
${version}/categories/${category}-code-points.js
: a JavaScript-compatible array containing all numerical Unicode code points in that category.${version}/categories/${category}-symbols.js
: a JavaScript-compatible array containing all Unicode symbols in that category as strings.${version}/categories/${category}-regex.js
: a JavaScript-compatible regular expression that matches all Unicode symbols in that category.
The same thing is done for scripts, blocks, and properties.
The data is currently being generated for the following Unicode versions:
- 2.0.14
- 2.1.9
- 3.0.1
- 3.2.0
- 4.0.1
- 4.1.0
- 5.0.0
- 5.1.0
- 5.2.0
- 6.0.0
- 6.1.0
- 6.2.0
- 6.3.0
- 7.0.0
- 8.0.0
- 9.0.0
- 10.0.0
I’ll update this repository (and this list) as soon as new Unicode versions are released.
I’ve included the Python (v2.7.1) and Bash (v3.2.48) scripts I wrote to generate these files. I’m new to Python, so suggestions on how to improve these scripts are more than welcome!
To (re-)generate all data in this repository, run:
./generate.sh
The generated data is fully tested by a script that verifies that, within the range of code points from 0x000000
to 0x10FFFF
, only the symbols in ${version}/${category}-symbols.js
are matched by the regular expression in ${version}/${category}-regex.js
. This rather heavy test case (which runs over 33 million assertions) is available online.
I’ve set up an HTTP API of sorts, which allows you to customize the output a little bit. This saves you from downloading, editing, and re-hosting the generated files if you just want to write some quick tests. Here’s an example:
http://mathias.html5.org/data/unicode/format?version=6.2.0&category=Ll&type=symbols&prepend=window.symbols%20%3D%20&append=%3B
category
: can be any Unicode categoryscript
: can be any Unicode scriptproperty
: can be any Unicode propertyblock
: can be any Unicode blocktype
: can becode-points
,symbols
orregex
; defaults tosymbols
version
: can be any Unicode version for which data is available; defaults to the latest available versionprepend
: a string to prepend to the output; defaults to the empty stringappend
: a string to append to the output; defaults to the empty string
Thanks to:
- Yusuke Suzuki for writing his Unicode database parser in Python.
- Michaeljohn “inimino” Clement for detailing the problems with regular expression ranges in JavaScript.
- Jan Moesen for teaching me about Bash’s Insanely Fantastic Splitting™.
- Steven Levithan for the valuable feedback and suggestions.