apache/datafusion

WASM UDFs

crepererum opened this issue · 12 comments

Is your feature request related to a problem or challenge?

Databases / ETL solutions built on top of DataFusion can use UDFs (in their various forms) to extend the functionality of DataFusion, e.g. to add new scalar or aggregation functions. This extensibility however does NOT automatically extend to their users, since they cannot and (for security reasons) should not add code to the running system. So the U in UDF currently stands for "DataFusion API User", not for "End-User".

WASM provides a way to run user code in a secure sandbox under an "unknown" host (i.e. the user does not need to know about the operating system or CPU architecture). A DataFusion-based solution can use that to implement UDFs. However, since the calling convention from the WASM payload to the UDF are solution-defined, the end user is likely to have a hard time with it, and there is likely only a non-existing/small ecosystem for tooling to develop UDFs.

Defining the UDF WASM interface in DataFusion -- potentially in collaboration with the Arrow (since we need to get Arrow data across the WASM memory boundary) -- would likely facilitate a wider ecosystem and a more streamlined solution. Prior art to this is Arrow Flight, which is now being integrated into more and more server and client implementations.

Describe the solution you'd like

  1. Define/find a way to pass Arrow data in/out a WASM payload.
  2. Define the WASM calling convention for the different types of UDFs (scalar, aggregation, window functions, ...). Make sure to version that interface so we can advance it later (e.g. by using new WASM features).
  3. Implement UDFs using wasmtime in DataFusion.
  4. Offer some easy blueprint / framework to develop UDFs in at least two languages.

Describe alternatives you've considered

  • Doing this as part of the DataFusion-based solution (i.e. downstream). See drawbacks illustrated within the intro.
  • Use other UDF interface types like Arrow IPC & a Python payload. That clearly has security issues and is harder to deploy/manage.

Additional context

Projects that might be helpful:

alamb commented

BTW #9332 might be related (API to allow functions to be registered externally)

Example exposing JVM as UDF with JNI https://github.com/milenkovicm/adhesive using #9332

From my experience WASM is going to be a bit more complicated as there are many things missing.

From my experience WASM is going to be a bit more complicated as there are many things missing.

Could you elaborate this? What things / building blocks / features / ... are missing? Is it WASM stuff, or Arrow/DataFusion stuff?

Ive done few pocs some time ago, they were focused on rust calling rust generated wasm. I found that documentation and/or examples missing for anything more complex than simple usage.
Also, I found that implementations claiming having complex type passing between rust and wasm have it disabled due to bugs/performance issues.

Due to all this I personally found it very complex to grasp, and not fit for my use case. Also, from this perspective, I believe I failed to understand importance of wasm bindgen as well. I gave up before reaching stage to add arrow.

Now, don't get discouraged, looks like there is progress in wasm community. Personally it looks wasm edge may be the way to go and I'll try to replace JVM implementation with wasm in near future, from JVM experience it should not take more than few days work to get basic example running.

I hope this helps

Due to all this I personally found it very complex to grasp, and not fit for my use case. Also, from this perspective, I believe I failed to understand importance of wasm bindgen as well. I gave up before reaching stage to add arrow.

I totally agree that mapping Arrow through the WASM boundary isn't a trivial task, which is also why I've opened this ticket and I wanna have one impl. and SDKs / blueprints for different UDF "guest" languages, so people don't have to suffer through this pain on their own.

My understanding is that any interaction with wasm would involve data copy across boundary. (schema, data) : (Vec<u8>,Vec<u8>) to and from wasm function, if that's the case interaction with wasm can be done as in https://github.com/second-state/wasmedge-rustsdk-examples/tree/main/wasmedge-bindgen

Do I miss something?

On that note, what would be the simplest way to convert [ArrayRef] and [DataType] to Vec<u8> and vice verse 😀?

I totally agree that mapping Arrow through the WASM boundary isn't a trivial task

It's very tied to JS at the moment, but you can use https://github.com/kylebarron/arrow-js-ffi to view Arrow data across the Wasm boundary in the cases where you want to persist the data in Wasm. I'm not sure how this would work with server-side wasm though. It seems like you could use the Arrow C Data interface across the Wasm boundary directly, though? The main work in Arrow JS FFI is just implementing the C Data Interface in JS, but it already exists in Rust.

This (and a related/successor project I've been working on https://github.com/kylebarron/arrow-wasm) are both quite tied to wasm-bindgen and browser use cases. It's not clear to me how they'd work in WASI.

I just read your article @kylebarron https://observablehq.com/@kylebarron/zero-copy-apache-arrow-with-webassembly 👍🏻 few minutes ago

This (and a related/successor project I've been working on https://github.com/kylebarron/arrow-wasm) are both quite tied to wasm-bindgen and browser use cases. It's not clear to me how they'd work in WASI.

https://lib.rs/crates/wasmedge-bindgen does not look tided up to browser/js

It might be relevant #9834

Fresh off the press, a quick WASM UDF poc built on top of WasmEdge library

https://github.com/milenkovicm/wasaffi

It is working, but far from being production ready.

Data needs to be copied to wasm VM ( arrow ipc-ed to local buffer then buffer copied across using wasmedge bindgen, and same for function result ). Most of the complexity can be hidden with a simple macro.

One note, ipc buffer contains schema definition as well, can this be avoided if both parties know the schema?

Most of the complexity can be hidden with a simple macro.

That was my hope. I think we should make sure though that the FFI WASM API is well defined and that other guest languages (like C++ or Python) could also implement that boilerplate w/o needing to go through a Rust layer within the guest (otherwise the WASM payloads get rather large).

One note, ipc buffer contains schema definition as well, can this be avoided if both parties know the schema?

Yes. This is for example what Arrow Flight (SQL) also uses. The client can query a server w/o knowing the schema a priori.

IMHO, I dont think there is too many problems on datafusion/arrow side, WASM side FFI interface is still left to be desired, and in constant state of flux.