Version: 1.2.2.
Some UTF-8 utility functions for Lua.
Tested with Lua 5.1.5, Lua 5.2.4, Lua 5.3.6, Lua 5.4.4 and LuaJIT 2.1.0-beta3 on Fedora 38.
Gets a UTF-8 sequence from a string.
local u8_seq = utf8Tools.getUCString(str, pos)
-
str
: The string to read. -
pos
: The start index of the UTF-8 sequence.
Returns: The UTF-8 sequence as a string, or nil
plus error string if unsuccessful.
Searches for a UTF-8 start octet in a string. A properly encoded UTF-8 string is expected, and the function does not perform any validation.
local index = utf8Tools.stepNext(str, pos)
-
str
: The string to search. -
pos
: The first byte index to check. Can be from1
to#str + 1
.
Returns: Index of the next starting octet, or nil
if the end of the string is reached.
Checks a string for UTF-8 encoding problems and bad code point values.
local ok, err = utf8Tools.check(str, [i], [j])
-
str
: The string to check. -
i
: (1) The first byte index. -
j
: (#str) The last byte index.
Returns: true
if no problems found. Otherwise, false
, position, and error string.
Tries to convert a UTF-8 sequence within a string to a numeric code point.
local code_point, err = utf8Tools.ucStringToCodePoint(str, pos)
-
str
: String containing the UTF-8 sequence to convert. -
pos
: Starting position in the string to check.
Returns: The code point in number form and its size as a UTF-8 sequence, or nil
and an error string if a problem was detected.
Tries to convert a code point in numeric form to a UTF-8 sequence string.
local u8_seq, err = utf8Tools.codePointToUCString(code)
code
: The code point to convert. Must be an integer.
Returns: the UTF-8 sequence in string form, or nil
and an error string if there was a problem validating the UTF-8 sequence.
These should be set to true
unless you have special requirements.
utf8Tools.options.check_surrogates
: (true) Functions will check the Unicode surrogate range. Code points in this range are forbidden by the UTF-8 spec, but some decoders allow them through.
options.exclude_invalid_octets
: (true) Functions will exclude UTF-8 sequences with bytes that are forbidden by the spec.