/base32768

Binary-to-text encoding highly optimised for UTF-16

Primary LanguageJavaScriptMIT LicenseMIT

base32768

Base32768 is a binary encoding optimised for UTF-16-encoded text. This JavaScript module, base32768, is the first implementation of this encoding.

The efficiency chart speaks for itself. Efficiency ratings are averaged over long inputs. Higher is better.

Encoding Efficiency Bytes per Tweet *
UTF‑8 UTF‑16 UTF‑32
ASCII‑constrained Unary / Base1 0% 0% 0% 1
Binary 13% 6% 3% 35
Hexadecimal 50% 25% 13% 140
Base64 75% 38% 19% 210
Base85 † 80% 40% 20% 224
BMP‑constrained HexagramEncode 25% 38% 19% 105
BrailleEncode 33% 50% 25% 140
Base2048 56% 69% 34% 385
Base32768 63% 94% 47% 263
Full Unicode Ecoji 31% 31% 31% 175
Base65536 56% 64% 50% 280
Base131072 53%+ 53%+ 53% 297

* New-style "long" Tweets, up to 280 Unicode characters give or take Twitter's complex "weighting" calculation.
† Base85 is listed for completeness but all variants use characters which are considered hazardous for general use in text: escape characters, brackets, punctuation etc..
‡ Base131072 is a work in progress, not yet ready for general use.

Base32768 uses only "safe" Unicode code points - no unassigned code points, no whitespace, no control characters, etc..

Installation

npm install base32768

Usage

import { encode, decode } from 'base32768'

const uint8Array = new Uint8Array([104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100])
const str = encode(uint8Array)
console.log(str)
// 6 code points, '媒腻㐤┖ꈳ埳'

const uint8Array2 = decode(str)
console.log(uint8Array2)
// [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]

API

base32768.encode(uint8Array)

Encodes a Uint8Array and returns a Base32768 String. Note that every Node.js Buffer is a Uint8Array.

The string is suitable for passing safely through almost any "Unicode-clean" text-handling API. This string contains no special characters and is immune to Unicode normalization. Give or take some padding characters, the output string has 1 character per 15 bits of input.

All characters are chosen from the Basic Multilingual Plane. This means that when encoded as UTF-16, all characters occupy 16 bits. Thus, there are 16 bits of output UTF-16 text per 15 bits of input, an efficiency of 93.75%.

base32768.decode(str)

Decodes a Base32768 String and returns a Uint8Array containing the original binary data. Note that a Uint8Array can be converted to a Node.js Buffer like so:

const buffer = Buffer.from(uint8Array.buffer, uint8Array.byteOffset, uint8Array.byteLength)

License

MIT