base32768

Base32768 is a binary encoding optimised for UTF-16-encoded text. This JavaScript module, base32768, is the first implementation of this encoding.

The efficiency chart speaks for itself. Efficiency ratings are averaged over long inputs. Higher is better.

Encoding		Efficiency			Bytes per Tweet *
Encoding		UTF‑8	UTF‑16	UTF‑32	Bytes per Tweet *
ASCII‑constrained	Unary / Base1	0%	0%	0%	1
	Binary	13%	6%	3%	35
	Hexadecimal	50%	25%	13%	140
	Base64	75%	38%	19%	210
	Base85 †	80%	40%	20%	224
BMP‑constrained	HexagramEncode	25%	38%	19%	105
	BrailleEncode	33%	50%	25%	140
	Base2048	56%	69%	34%	385
	Base32768	63%	94%	47%	263
Full Unicode	Ecoji	31%	31%	31%	175
	Base65536	56%	64%	50%	280
	Base131072 ‡	53%+	53%+	53%	297

* New-style "long" Tweets, up to 280 Unicode characters give or take Twitter's complex "weighting" calculation.
† Base85 is listed for completeness but all variants use characters which are considered hazardous for general use in text: escape characters, brackets, punctuation etc..
‡ Base131072 is a work in progress, not yet ready for general use.

Base32768 uses only "safe" Unicode code points - no unassigned code points, no whitespace, no control characters, etc..

Installation

npm install base32768

Usage

import { encode, decode } from 'base32768'

const uint8Array = new Uint8Array([104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100])
const str = encode(uint8Array)
console.log(str)
// 6 code points, '媒腻㐤┖ꈳ埳'

const uint8Array2 = decode(str)
console.log(uint8Array2)
// [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]

API

base32768.encode(uint8Array)

Encodes a Uint8Array and returns a Base32768 String. Note that every Node.js Buffer is a Uint8Array.

The string is suitable for passing safely through almost any "Unicode-clean" text-handling API. This string contains no special characters and is immune to Unicode normalization. Give or take some padding characters, the output string has 1 character per 15 bits of input.

All characters are chosen from the Basic Multilingual Plane. This means that when encoded as UTF-16, all characters occupy 16 bits. Thus, there are 16 bits of output UTF-16 text per 15 bits of input, an efficiency of 93.75%.

base32768.decode(str)

Decodes a Base32768 String and returns a Uint8Array containing the original binary data. Note that a Uint8Array can be converted to a Node.js Buffer like so:

const buffer = Buffer.from(uint8Array.buffer, uint8Array.byteOffset, uint8Array.byteLength)

License

MIT