jecolon/ziglyph

Question: More size-optimal grapheme cluster sorting?

Closed this issue · 7 comments

Hi @jecolon ! I'm using ziglyph in something like a regexp engine but instead of traditional [\x{1F600}-\x{1F64F}] Unicode codepoint ranges, which require abominations like this to match all valid emojis - I plan to implement range matching in terms of grapheme clusters.

I can do this today with the Unicode collation algorithm, but due to it depending on normalization and the Unicode sort keys (UCA/latest/allkeys.txt) it ends up being a quite large binary. With gzip --best the two files clock in around a half a MB:

hexops-mac:zorex slimsag$ du -sh asset/
568K	asset/
hexops-mac:zorex slimsag$ du -sh asset/*
308K	asset/uca-allkeys.txt.gz
260K	asset/ucd-UnicodeData.txt.gz

I'm aiming to have WebAssembly support be quite nice, so I'm looking for ways to reduce the file size needed for sorting grapheme clusters.

I have a few ideas I'm planning to explore - namely that I suspect a binary encoding of the sort keys and UnicodeData.txt could go a long way in reducing the file size (it also seems these files may have comments in them that could be omitted perhaps?)

But I figured I'd reach out first and ask: has anyone else thought about this? Are there perhaps better ways to do grapheme cluster sorting that I have missed?

Hello @slimsag ! While working on the normalization and collation parts of Ziglyph, I also realized that the data size requirements for these two algorithms was huge compared to the rest of the components. For example, a binary that incorporates the Ziglyph struct or Zigstr, never goes above 250K in ReleaseFast mode. Just adding normalization jumps to 585K and collation jumps to 905K! So you're definitely right that this is not optimal.

In the other components of the library, I was able to process the data and generate Zig code that doesn't need the source data files to do its work. But in the case of normalization and collation, this approach also resulted in huge Zig source files that also produced large binaries and sometimes even failed to compile! So that's why for only these two components I had to still depend on the data files, loading them when the structs are initialized. I'm going to investigate optimization options for these components in the Unicode spec, and I'll keep you updated here. If you know or find something, don't hesitate to let me know! I appreciate the help and feedback.

Ok, some progress. It turns out that analyzing how Normalizer used the UnicodeData.txt file, it only needed 2 fields, and in addition, only records in which the second field wasn't empty, so I modified the generation code to derive only this data from the UnicodeData.txt frile into a new Decompositions.txt file. The difference is big, from 1.8M for UnicodeData.txt to 102K for Decompositions.txt. All tests are passing, so I'm pretty sure this change in the data has maintained the same functionality. Now I'll see what can be done with the allkeys.txt file for collation.

Ok got the Decompositions.txt file further down from 102K to 70K. And derived an allkeys-mininal.txt file that's 498K versus the 1.9M original allkeys.txt file. All tests passing. Even so, binary sizes when including Normalizer of Collator are still pretty big. Going to keep drilling down to see what I can find.

Very cool progress @jecolon ! Especially interesting to see the UnicodeData.txt reduction..

I'm curious if you could shed some more light on how you performed those reductions? It seems for the allkeys.txt it's mostly the same data but maybe some ~2k lines removed plus a new format?

I am wondering how much of your reductions were changing the format to be more brief vs. removing entries, and if the latter, how hard it'd be to get a copy of these files with the same format but with the entries you found are not needed removed.


This weekend I did some quick experiments with these files in Go (just as a starting point because I can move more quickly in it, a proper Zig version will come later)

I pulled in some help from my brother @Andoryuuta and he experimented with a binary encoding for these files. He was able to get _the original 1.9M allkeys.txt (not your reduced one) down to ~250K with that approach, and ~125K gzipped while retaining all data excluding comments.

On my side, I took your already-reduced allkeys.txt and Decompositions.txt and started trying to write a small compression algorithm specifically for these files. I came up with a small sate machine + differential encoding that was able to reduce these quite substantially:

File Original Original + gzip -9 My compression My compression + gzip -9
Decompositions.txt 72K 28K 48K 12K
allkeys-minimal.txt 500K 148K 204K 36K

Because I did these experiments in Go, I am lacking a few efficiencies: there are a lot of locations where I encode things as uint8 instead of a more optimal u4, or uint32 instead of a more optimal u21, etc. I think that it should be possible to get even smaller sizes.

Also, I tested with gzip above but really I should figure out what the most common compression is for .wasm files (brotli? not sure). I think we can get levels competitive or better than gzip -9 without as big of a dependency as gzip.

Okay, here's some updated numbers with brotli compression (which AFAIK is the best for compressing .wasm files generally, and you basically get it for free when shipping WebAssembly due to it being built-in to browsers and not an extra dependency):

File Original gzip -9 brotli mine mine + gzip -9 mine + brotli
Decompositions.txt 72K 28K 16K 48K 12K 12K
allkeys-minimal.txt 500K 148K 64K 204K 36K 28K

I'll start working on a Zig implementation soon. Some napkin math makes me think it might be possible to shave off around 41K of allkeys-minimal.txt (with just my compression method, no clue how effective gzip/brotli will be after that) bringing it down to ~7K

Wow, that's some impressive compression! For both original files, I removed all comments, empty lines, unecessary spaces, and removed all leading zeroes from numbers.

To produce Decompositions.txt from UnicodeData.txt, I also removed all fields except the code point field (#1) and the decomposition filed (#6). Also, from this latter field, I don't use the full decomposition type name which is enclosed in angle brackets, I only need to know if it's present in the field, so I left only the initial < if present.

For the reduction from allkeys.txt to allkeys-minimal.txt, I removed all records for code points that are not canonically decomposed, since the collation algorithm's first step requires you to decompose the string's code points to canonical form (NFD), so non-NFD records in the file are unused by the algorithm. These are the ~2,000 lines less you noticed. I also removed the square brackets, extra asterisks and dots, producing a more uniform, compact, semicolon delimited file.

All this I did with sed and a little awk in shell scripts. As your case with faster development in Go, I too figure all this can be done with Zig later on, but for rapid results, the scripts are hard to beat. 😸

@jecolon With my last PR here, I'm quite happy with the results: #9

We will have brought the original two files required from 568K (with gzip) down to just 61K (with UDDC+gzip), which feels within the realm of "can be included by default in most applications (incl. WebAssembly ones)"

(For comparison, I created a fairly popular React-like framework in Go and even with a Go compiler designed to produce tiny binaries with serious limitations (TinyGo) and extremely minimal dependencies, the smallest bundle size I could get there was 97K gzipped - and that didn't include any Unicode support at all. Standard Go compiler produces WASM binaries that, when gzipped, are over half a megabyte. -- so 61K for full unicode support is quite incredible I think actually)