/UniRecover

A library for substituting illegal bytes in Unicode encoded data following W3C spec.

Primary LanguageElixirMIT LicenseMIT

Update

As of Elixir 1.16 String.replace_invalid/2 is available for utf-8 substitution.

utf-16 and utf-32 substitution are available in elixir-unicode/unicode, using Unicode.replace_invalid/3.

UniRecover

A library for substituting illegal bytes in Unicode encoded data, following W3C spec as suggested by the Unicode Standard.

This library leverages Erlang Sub Binaries to scale well with large amounts of data. This should suffice for most use-cases, short of those that may necessitate NIF-based solutions.

Installation

Add :uni_recover to your list of dependencies in mix.exs:

def deps do
  [
    {:uni_recover, "~> 0.1.2"}
  ]
end

Documentation is available on HexDocs and may also be generated with ExDoc.

Usage

# 0b11111111 = an illegal utf-8 code sequence
UniRecover.sub(<<"foo", 0b11111111, "bar">>)
# "foo�bar"

# 216, 0 = an illegal utf-16 code sequence
(UniRecover.sub(<<"foo"::utf16, 216, 0, "bar"::utf16>>, :utf16)
|> :unicode.characters_to_binary(:utf16))
# "foo�bar"

Benchmarking

The following benchmark demonstrates how UniRecover leverages sub binaries, only allocating the indexes of illegal bytes. See the benchmarking folder in the repo for details.

Name                                  ips        average  deviation         median         99th %
UniRecover, 207KB Input           1842.84      542.64 μs     ±1.44%      539.67 μs      574.71 μs
Simple Rebuild, 207KB Input        172.02     5813.34 μs    ±13.88%     5534.29 μs     8223.92 μs
Naive 3-liner, 207KB Input          56.59    17670.58 μs     ±6.44%    17377.60 μs    19210.26 μs

Comparison: 
UniRecover, 207KB Input           1842.84
Simple Rebuild, 207KB Input        172.02 - 10.71x slower +5270.70 μs
Naive 3-liner, 207KB Input          56.59 - 32.56x slower +17127.94 μs

Memory usage statistics:

Name                           Memory usage
UniRecover, 207KB Input               296 B
Simple Rebuild, 207KB Input       8215208 B - 27754.08x memory usage +8214912 B
Naive 3-liner, 207KB Input       39556040 B - 133635.27x memory usage +39555744 B

For reference, the Simple implementation allocated 39.66x the original json, and Naive even worse at a whopping 191x the original.