A TypeScript library for handling UTF-8 encoded strings, safely processing byte arrays containing invalid UTF-8 sequences.
- Splits byte arrays into valid UTF-8 strings and invalid byte sequences
- Compliant with UTF-8 encoding specification (RFC 3629)
- Provides iterator interface for processing large data streams
- Fully type-safe
npm install utf8_chunksimport { Utf8Chunks, Utf8Chunk } from 'utf8_chunks';
// Create a byte array containing invalid UTF-8 sequences
const bytes = new Uint8Array([0x66, 0x6F, 0x6F, 0xF1, 0x80, 0x62, 0x61, 0x72]);
// Process the byte array using Utf8Chunks iterator
const chunks = new Utf8Chunks(bytes);
// Get the first chunk
const chunk = chunks.next().value;
// Get the valid UTF-8 string portion
console.log(chunk.valid); // Output: "foo"
// Get the invalid byte sequence
console.log(chunk.invalid); // Output: Uint8Array(2) [241, 128]interface Utf8Chunk {
valid: string; // Valid UTF-8 string
invalid: Uint8Array; // Invalid byte sequence
}The Utf8Chunks class implements the Iterable<Utf8Chunk> interface for iterating over byte arrays.
class Utf8Chunks implements Iterable<Utf8Chunk> {
constructor(source: Uint8Array);
[Symbol.iterator](): Iterator<Utf8Chunk>;
}- Invalid byte sequences have a maximum length of 3 bytes
- If
invalidis empty, this is the last chunk in the string - If
invalidis non-empty, an unexpected byte was encountered or the input ended unexpectedly
This library is inspired by Rust's std::str::Utf8Chunk implementation.
For more details about the original Rust implementation, see Rust's Utf8Chunk documentation.
MIT