/arrow-js-ffi

Zero-copy reading of Arrow data from WebAssembly

Primary LanguageTypeScriptMIT LicenseMIT

arrow-js-ffi

Interpret Arrow memory across the WebAssembly boundary without serialization.

Why?

Arrow is a high-performance memory layout for analytical programs. Since Arrow's memory layout is defined to be the same in every implementation, programs that use Arrow in WebAssembly are using the same exact layout that Arrow JS implements! This means we can use plain ArrayBuffers to move highly structured data back and forth to WebAssembly memory, entirely avoiding serialization.

I wrote an interactive blog post that goes into more detail on why this is useful and how this library implements Arrow's C Data Interface in JavaScript.

Usage

This package exports two functions, parseField for parsing the ArrowSchema struct into an arrow.Field and parseVector for parsing the ArrowArray struct into an arrow.Vector.

parseField

Parse an ArrowSchema C FFI struct into an arrow.Field instance. The Field is necessary for later using parseVector below.

  • buffer (ArrayBuffer): The WebAssembly.Memory instance to read from.
  • ptr (number): The numeric pointer in buffer where the C struct is located.
const WASM_MEMORY: WebAssembly.Memory = ...
const field = parseField(WASM_MEMORY.buffer, fieldPtr);

parseVector

Parse an ArrowArray C FFI struct into an arrow.Vector instance. Multiple Vector instances can be joined to make an arrow.Table.

  • buffer (ArrayBuffer): The WebAssembly.Memory instance to read from.
  • ptr (number): The numeric pointer in buffer where the C struct is located.
  • dataType (arrow.DataType): The type of the vector to parse. This is retrieved from field.type on the result of parseField.
  • copy (boolean): If true, will copy data across the Wasm boundary, allowing you to delete the copy on the Wasm side. If false, the resulting arrow.Vector objects will be views on Wasm memory. This requires careful usage as the arrays will become invalid if the memory region in Wasm changes.
const WASM_MEMORY: WebAssembly.Memory = ...
const wasmVector = parseVector(WASM_MEMORY.buffer, arrayPtr, field.type);
// Copy arrays into JS instead of creating views
const wasmVector = parseVector(WASM_MEMORY.buffer, arrayPtr, field.type, true);

parseRecordBatch

Parse an ArrowArray C FFI struct plus an ArrowSchema C FFI struct into an arrow.RecordBatch instance. Note that the underlying array and field must be a Struct type. In essence a Struct array is used to mimic a RecordBatch while only being one array.

  • buffer (ArrayBuffer): The WebAssembly.Memory instance to read from.
  • arrayPtr (number): The numeric pointer in buffer where the array C struct is located.
  • schemaPtr (number): The numeric pointer in buffer where the field C struct is located.
  • copy (boolean): If true, will copy data across the Wasm boundary, allowing you to delete the copy on the Wasm side. If false, the resulting arrow.Vector objects will be views on Wasm memory. This requires careful usage as the arrays will become invalid if the memory region in Wasm changes.
const WASM_MEMORY: WebAssembly.Memory = ...
// Pass `true` to copy arrays across the boundary instead of creating views.
const recordBatch = parseRecordBatch(WASM_MEMORY.buffer, arrayPtr, fieldPtr, true);

Type Support

Most of the unsupported types should be pretty straightforward to implement; they just need some testing.

Primitive Types

  • Null
  • Boolean
  • Int8
  • Uint8
  • Int16
  • Uint16
  • Int32
  • Uint32
  • Int64
  • Uint64
  • Float16
  • Float32
  • Float64

Binary & String

  • Binary
  • Large Binary (Not implemented by Arrow JS but supported by downcasting to Binary.)
  • String
  • Large String (Not implemented by Arrow JS but supported by downcasting to String.)
  • Fixed-width Binary

Decimal

  • Decimal128 (failing a test)
  • Decimal256 (failing a test)

Temporal Types

  • Date32
  • Date64
  • Time32
  • Time64
  • Timestamp (with timezone)
  • Duration
  • Interval

Nested Types

  • List
  • Large List (Not implemented by Arrow JS but supported by downcasting to List.)
  • Fixed-size List
  • Struct
  • Map
  • Dense Union
  • Sparse Union
  • Dictionary-encoded arrays

Extension Types

  • Field metadata is preserved.

TODO:

  • Call the release callback on the C structs. This requires figuring out how to call C function pointers from JS.