jvail/spl.js

Implementing range-request lookups

alukach opened this issue · 24 comments

Reading through the seminal Hosting SQLite databases on Github Pages blogpost, I see that the author did the following:

implemented a virtual file system that fetches chunks of the database with HTTP Range requests when SQLite tries to read from the filesystem

The output of this is captured in sql.js-httpvfs.

What is the feasibility of adding such a feature to this codebase? I'd like to be able to query a large (300MB+) Spatialite database without requiring the end-user to load the file in its entirety.

I am opening to the idea of contributing to this feature but am not well versed in WASM so this would be a bit of a learning exercise on my part.

jvail commented

I have implemented such a feature to be able to query proj.db remotely. It works like this: https://github.com/jvail/spl.js/blob/main/examples/openlayers.html#L50

The implementation is here: https://github.com/jvail/spl.js/blob/main/src/pre.js#L52 It is just a simple extension of workerfs. I am sure it can be improved. I thought I leave caching to SQLite's page caching mechanism. If you have suggestions to improve it, then of course any idea/code is welcome.

jvail commented

I played a bit with the widgets on the blog page - interesting and much more efficient. As far as I can see from the logs the implementation is just doubling the bytes on each sequential page requested. It may even cache the entire database... I am not sure if I like that idea.

But from looking at the hundreds of requests spl.js needs to only fetch data from proj.db for doing one simple query I am sure this needs improvement...

Thanks, this is interesting, I was not aware that such features already existed.

Looking through the docs & code samples, I am not understanding how someone would reference a hosted Spatialite DB without loading it in its entirety via fetch().

Here's more or less what I've been trying, however the records object is always undefined.

import SPL from "spl.js";

SPL().then((spl) =>
  spl
    .mount("db", [
      {
        name: "data",
        data: "https://foo.com/spatialite.db",
      },
    ])
    .db("file:db/data?immutable=1")
    .exec("SELECT name FROM sqlite_schema WHERE type='table' ORDER BY name")
    .get.objs.then((records) => {
      console.log({ records });
      return records;
    })
);
jvail commented

Not sure what your 'foo.com' database was that you tested. Here is a simple example. If you place it in the example folder of this repo and you run serve . on the top level it will give the desired result:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8">
  </head>
  <body>
    <script type="module">

        import SPL from '../dist/index.js';

        (async () => {
            try {

                SPL().then((spl) =>
                    spl.mount("db", [{
                            name: "data",
                            data: new URL('../dist/proj/proj.db', window.location.href).toString(),
                        }])
                        .db("file:db/data?immutable=1")
                        .exec("SELECT name FROM sqlite_schema WHERE type='table' ORDER BY name")
                        .get.objs.then((records) => {
                            console.log({ records });
                            return records;
                        })
                    );

            } catch (err) {
                console.log(err);
            }
        })();

    </script>
  </body>
</html>

Looking through the docs & code samples, I am not understanding how someone would reference a hosted Spatialite DB without loading it in its entirety via fetch().

When the data attrs value is a string/a valid URL then inside WORKERFS if will fetch the requested data/pages with XHR (instead of reading it from a blob/arraybuffer).

Thanks for all of this information! It turns out that my issue of an empty table was due to my development server not returning Content-Length details on HEAD requests (parcel-bundler/parcel#8405) (something I thought I had rectified locally).

All is working as expected now 🙌

How possible would it be to modify the simple API:

        import SPL from '../dist/index.js';

        (async () => {
            try {

                SPL().then((spl) =>
                    spl.mount("db", [{
                            name: "data",
                            data: new URL('../dist/proj/proj.db', window.location.href).toString(),
                        }])
                        .db("file:db/data?immutable=1")
                        .exec("SELECT name FROM sqlite_schema WHERE type='table' ORDER BY name")
                        .get.objs.then((records) => {
                            console.log({ records });
                            return records;
                        })
                    );

            } catch (err) {
                console.log(err);
            }
        })();

to take a function data: fn that gets passed the range request variables? That'd allow us to pipe in any ole mechanism independent of fetch or buffers and return the necessary info.

I'm looking at some webrtc function where clients have their own potentially large databases and it'd be nice to be able to pass in just the range of necessary details.

I could probably write it if its in plain javascript and you point me to the entry point for this function.

jvail commented

Hello @disarticulate,

I think, for sure the data thing could be extended to represent any other object or even function (... though passing the function to the worker would require some additional work) since there is already a range of things that data may represent like Buffer, Files, URL etc.

to take a function data: fn that gets passed the range request variables? That'd allow us to pipe in any ole mechanism independent of fetch or buffers and return the necessary info.

It is not entirely clear for me what you try to achieve, what would "range request variables" be other than the URL?

Do you think you could sketch a minimal example (JS and/or pseudo code is just fine) to better describe the mechanics? I thought those remote DBs via range requests would be more a gimmick (also work around for proj.db) rather than a real use case. But if you have a serious use case then let's give it a try!

so i'd like to pass the getter of: https://github.com/mafintosh/abstract-chunk-store (or it's getter) to the data as:

const chunkStore = new ChunkStore(chunkLength) // user specific
const dataFn = ({ offset, length }, cb) => chunkStore.get(0, { offset, length },  cb)
    // I think it's more standard to do (range, cb) when doing callbacks
    // One can test if they have a promise and implement it without a cb
/*
const isFunction = (fn) => {
  return fn && (typeof fn === "function")
};
const isPromise = (p) => {
  return (
    p !== null &&
    typeof p === 'object' &&
    typeof p.then === 'function' &&
    typeof p.catch === 'function'
  )
};

*/ 
SPL().then((spl) =>
    spl.mount("db", [{
        name: "data",
        data:  dataFn
    }])
///pre-fetch calling request with offset and length inside library, alongside whatever initiates fetch/xhr via range request
if (isPromise(data)) {
 try{
    const bytes =  await dataFn({ offset, length })
    worker.send('rangerequested', bytes)
  } catch(e){ console.warn(e) // I dont let bad userfns crash my code here }
} else if (isFunction(data) {
   dataFn({ offset, length }, (bytes) => worker.send('rangerequested', bytes))
}

i'm not sure how the range request is implemented but all we're looking for is byteLength and offset (Range) to be passed to either a promise function or a function with a callback.

The uses I'm specifically looking at revolve around webrtc and peers querying each other's dbs in the browser. If I can pass in a getter for a chunk store (or just any function with my preferred storage location of a buffer) and the library has to wait for me (the same as a fetch in a range request) to give it the range (rather than the fetch request), then I can let people query one another's database sitting in their browser. Which just by itself is neat.

I think the web worker setup would just mean passing into the worker the results of the function, which I guess does have some overhead if we're trying to offload everything into a webworker, but in the apps I'm working with I already have a bunch of other webworkers and would need to pipe data into and out of them hem anyway via a function in the main.

jvail commented

A few questions and comments:

I don't understand how the worker could communicate with the chunk store. First of all async wont work because of sqlite. It needs to be synchronously (luckily XHR is and it is still available in workers). Could it synchronously fetch data via a WebRTC connection? I don't have any experience with WebRTC but it is not available in workers as far as I can see ...

Another idea would be to allow a SharedArrayBuffer as data and figure out some kind of API around it working with Atomics. So the worker would effectively pause while waiting for the data to be written into the buffer (whatever the source may be ...) - needs to be read-only as well, I guess. The worker could write offset and length to the buffer to notify the main thread that data needs to be fetched. Sounds a bit weird but I can not think of another way to work around the sync limitation - if XHR is not used. This is a bit similar to what absurd-sql does - if I remember correctly.

so that looks documented here in rough?:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SharedArrayBuffer
https://exploringjs.com/es2016-es2017/ch_shared-array-buffer.html

This is a very simple example that always works, thanks to Atomics (I’ve omitted setting up sharedArray):

// main.js
console.log('notifying...');
Atomics.store(sharedArray, 0, 123);

// worker.js
while (Atomics.load(sharedArray, 0) !== 123) ;
console.log('notified');

//Using a while loop to wait for a notification is not very efficient, which is why Atomics has operations that help:
Atomics.wait(ta: Int32Array, index, value, timeout)
// waits for a notification at ta[index], but only if ta[index] is value.
Atomics.wake(ta : Int32Array, index, count)
// wakes up count workers that are waiting at ta[index].

the while loop would make it sync the same way an xhr (this stuff is over my head at this point).

So our data function then would be something like?

const sab = new SharedArrayBuffer(1024); // I assume this has to be the size of the range page request (absurd-sql) right?
const sabNotice = new  SharedArrayBuffer(1); //I assume we need to have to notify when operations are done; it sounds like you could do ith with just a single object; in my mind, you write the sabNotice object when everythings complete.
SPL().then((spl) =>
    spl.mount("db", [{
        name: "data",
        data:  {sab , sabNotice}
    }])

// worker.js
Atomics.wait(sabNotice, index, value, timeout)

// userland.js
var sabArr = new Int8Array(sab)
sabArr = new Int8Array(sabArr)
// get data and set to sabArr, however you want it'll definitely be async.
var arr = new Int8Array(sabNotice);
arr[0] = 1;

This is over my head (and the two-way communication is confusing unless I can code it until its right) as far as what's going on other than it seems like it satisfies the sync requirement with the worker and sqlite as you indicate.

Sadly, this is the best I can do for assistance, unless you want me to test and write up documentation.

jvail commented

Yes, something like this. But maybe one buffer would be sufficient (defining some sort of header and data section indices). So it needs to have a length of header + database page size.

This is over my head (and the two-way communication is confusing unless I can code it until its right) as far as what's going on other than it seems like it satisfies the sync requirement with the worker and sqlite as you indicate.

I have so far no experience with SharedArrayBuffer & Atomics either. But maybe there are some examples out there solving a similar problem. Unless you don't push this forward I'll certainly keep it on my todo list. But I am not sure when I find the time to look into it :\

The relevant code to implement new options for the worker file system is here https://github.com/jvail/spl.js/blob/main/src/pre.js. I have just copied part of the source from emscripten's WorkerFS and modified it with the XHR... just in case you'd like to give it a try.

ok, so I had a fork started so I updated it and will start blueprinting what I understand here: https://github.com/disarticulate/spl.js/blob/main/src/pre.js

It looks like the easiest way to start is assume the user passes in the request SharedArrayBuffer, get that working and then abstract that out so that they could submit a function and supply the range size and the offset chunks.

I'll come back when I have something that seems to make sense.

jvail commented

It looks like the easiest way to start is assume the user passes in the request SharedArrayBuffer, get that working and then abstract that out so that they could submit a function and supply the range size and the offset chunks.

Yes, but no we really need the function? I was hoping the offset/length request could be send via the buffer itself.

I'll come back when I have something that seems to make sense.

Let me know if you need a hint/comment on my code in case it is not clear what's going on.

I was hoping the offset/length request could be send via the buffer itself.

That'll be the base API. But doesn't seem very user friendly. I was thinking a helper function could wrap a user function, produce the necessary SharedArrayBuffer and the user just supplies the appropiate functions for slices and the size.

Atomics appears to only operate on IntArray32, (http://udn.realityripple.com/docs/Web/JavaScript/Reference/Global_Objects/Atomics/wait) and since neither of us really understands what it's doing, it'd be better to abstract it away for a user who just has a regular Uint8array or whatever.

Anyway, that's a secondary step that can be discussed if we get it working with a parallel API as XHR

OK. so I've laid out a blueprint here: https://github.com/disarticulate/spl.js/blob/main/src/pre.js

With a bunch of notes. It seems like we should do TDD, as setting up just testing out and ensuring the code functions appropriately requires some setup, so it'd be best to document that in the test section.

On that, it doesn't look like the XHR is tested, which I guess would require a simple HTTP server that can handle range requests.

Should we create a separate test file for these two? We just want to ensure they're communicating remotely, and the code should be executable in any environment I suppose, though I didn't look at whether node does shared buffers.

jvail commented

OK. so I've laid out a blueprint here: https://github.com/disarticulate/spl.js/blob/main/src/pre.js

Thank you! Give me a bit of time to try to understand what's going on.

With a bunch of notes. It seems like we should do TDD, as setting up just testing out and ensuring the code functions appropriately requires some setup, so it'd be best to document that in the test section.

Maybe we can start with EDD (example driven development :) ) first. I need to have a better understanding how it could practically look like on the "main" thread - wrapping the sab in some kind of class/api to make it more convenient to use - as you suggested.

On that, it doesn't look like the XHR is tested, which I guess would require a simple HTTP server that can handle range requests.

Yes, tests for XHR are indeed missing.

Should we create a separate test file for these two? We just want to ensure they're communicating remotely, and the code should be executable in any environment I suppose, though I didn't look at whether node does shared buffers.

Maybe we can start by creating a little example reproducing the use case with a remote proj.db using fetch on the main thread to populate the sab as requested. I am not sure how we can re-use the idea with requesting multiple pages at once as done in the XHR since we need to make sure the sab is always large enough. Maybe set an upper limit (multiple of page size?) for the length of the request range?

P.S.
I read a bit through the docs on Atomics because I was wondering how the main thread could be notified about the requested range. Luckily there is waitAsync ... but it is not supported in FF :/
So, I guess we would then need another worker to do the fetching ... seems quite complicated. What do you think @disarticulate, any idea?

Sadly, I thought I scanned that page fully. It looks like Safari is also unsupported. Also, it looks like the code is missing the Atomics.notify(typedArray, index, count) operation and we could simplify everything by just using the BigInt64Array positions, which makes a lot of the code logic cleaner.

I've got plenty of experience wiring up workers, so that's not a big deal and I think a world filled with worker threads for the browser is the future. I've typically built universal workers to load whatever libraries I need. This ends up something like:

  1. Task runner api to initiate and own a worker in the main thread
  2. Separate "plugin" class that adds methods to the task runner that call functions to the worker libraries
  3. A worker is started similar to https://github.com/jvail/spl.js/blob/main/src/spl-worker.ts but much more generic, as the worker code imports the libraries, classes, and function, ...etcs and calls into a object, then generic messages & args pass into worker, then just searches for the corresponding function call, awaits as needed, then returns.

The magic in this system is a uuid where each function call into a worker is given a uuid, so you can load up as many workers as you have threads and locate the caller in main with the worker responding. One of the reasons the SAB seems interesting to me as it solves one of the issues I've had with parallel processing these calls and related caches.

With that said, I'd still start out using Atomics.waitAsync as this thread https://bugzilla.mozilla.org/show_bug.cgi?id=1467846 suggests it's atleast in the pipeline, and throw in a error/warning to start, as we're still going to need a lot of the same code for the SAB, it'd just be neater if everything could be waitAsync. According to https://caniuse.com/mdn-javascript_builtins_atomics_waitasync that's satisfy 70% of browsers/users.

Obviously thats a lot easier said than done. The fallback then would be a worker. I assume we would add a call site into https://github.com/jvail/spl.js/blob/main/src/spl-worker.ts

What would a function call into spl-worker that uses Atomics.wait look like? Duplicate necessary functionality there? It'd basically just be a while loop like:

const int32 = new Int32Array(sab);
const value = Atomics.wait(int32, SAB.READY_STATE_I32, SAB.READY_STATE_REQUEST_SIZE)
if (value === WAIT_OK) ... else if (value === WAIT_NO_EQUAL) ... else if (value === WAIT_TIMED_OUT) ...

It sounds like we'd need to load that worker separately as you indicated.

But I suppose I'm not deterred by adding workers. writing all that up, it seems like a purpose built worker that accepts a SAB and just runs wait loops would be cleaner, i just dont know how that fits in with the organization of the workers you already run.

jvail commented

With that said, I'd still start out using Atomics.waitAsync as this thread https://bugzilla.mozilla.org/show_bug.cgi?id=1467846 suggests it's atleast in the pipeline, and throw in a error/warning to start, as we're still going to need a lot of the same code for the SAB, it'd just be neater if everything could be waitAsync. According to https://caniuse.com/mdn-javascript_builtins_atomics_waitasync that's satisfy 70% of browsers/users.

Yes, why not keep it as an experimental feature (like the spl extensions) that would initially only work in chrome? I'd also prefer to not start with additional workers since that adds even more complexity. Still someone could have a worker running spl in a sub-worker, right? Also the extra worker would not help with your initial use case using WebRTC.

I'd guess for the API (some object with settings and the sab that we can send over to the worker) on the main thread we need to expose some kind of "long running Promise" (something like an observable https://rxjs.dev/guide/observable - this is not to suggest to use the library) that then notifies the custom user code about the range requested. The "thing", API could check if waitAsync is available and if not use wait as a fall back thereby forcing the user to move the code into a worker.

Could you move your code to some experimental-sab-fs (or whatever name you prefer) branch? Then I could play a bit with the code and support you with ideas etc.

OK. had some other things to look at. Here's the fork with the current changes, and added the notify() to atomics since it looks like its required. a final version should clean up the unecessary uint32 conversions and just us u64 for everything. https://github.com/disarticulate/spl.js/tree/3985d4c0a12c23a9253e7f1682add7f8db75d356

I implement the library typically in a worker, so as you suggest, a blocking fallback isn't a game killer.

I'll take a look at observables and probably polling as alternative implementations to missing the async version of waiting.

i've started a sab-helper.js in the same branch which should implement a class which should provide a user friendly interface, and can accept a function.

alright, so how do i know what size to make the SAB on the client side? I can't figure out where the page size variable is set. Looks like one could resize the SAB, but what I understood is there's a fixed size for the SAB since we're just requesting pages in the range.

jvail commented

alright, so how do i know what size to make the SAB on the client side? I can't figure out where the page size variable is set.

We could have the pageSize as an optional parameter for the helper-thing and if not set we could fetch the SQLite header (first 100 KB or a part of it), look up the page size and then set the sab size to page size or to some multiple of page size if we want to do some pre-fetching of pages.

jvail commented

Hi @disarticulate,

not sure f you are still working on your backend/persistence solution. Apparently sqlite now provides it's official wasm build. There is also an interesting section about persistence in the browser. Maybe helpful...

yeah, looks like if we were able to get shared array buffers, there'd still be browser issue with servers having to allow headers.

Anyway, i got sidetracked, and not sure when I'll return to this portion of my project interest.

jvail commented

@disarticulate Feel free to re-open is case you pick up your project again.