usnistgov/h5wasm

Enable ROS3 Driver

garrettmflynn opened this issue · 17 comments

Is there a way to point to a ROS3 driver in the current implementation? I've gotten some interest to integrate my wrapper into DANDI instead of using the current cloud-based visualizer.

The ROS3 driver makes HTTP requests directly from the C-code (using CURL), which is not really possible in WebAssembly. You can call back into Javascript to make requests, or you can use the Emscripten Fetch API, but directly using the CURL libraries won't work as far as I can tell. A patched version of H5FDs3comms.c could be written that uses e.g. the Fetch API, but that would be a bit of work.

My hdf5-io library already has a basic wrapper for the JavaScript Fetch API implemented, if this is what you meant by the first option. Though it seemed like the functionality we're looking for could only be implemented on the HDF5 reader itself.

Since I'm still quite new to HDF5 and C / WASM, I'll see if I can get @satra to comment more about the requirements for integrating with DANDI.

satra commented

indeed it looks like there is a thread here to talk about curl and web assembly: WebAssembly/WASI#107

the main intent is to leverage the streamability of these HDF5 files on DANDI as opposed to downloading them into the browser (likely to be impossible except for the tiniest of files).

but the streamability relies on treating s3 as a range-getable filesystem, which is what the curl layer in hdf5 does. in the python world s3fs works similarly, and i know there have been some attempts in various contexts in javascript, but i don't know where they stand.

It is possible to build this directly into the Emscripten filesystem without ROS3... I have some POC code that combines lazyFile.ts from https://github.com/phiresky/sql.js with typescript-lru-cache and can lazy-load files with h5wasm in a WebWorker.

Having to call your HDF5 access code in a worker adds some complexity, of course. I'll try to clean up my code and publish to github soon.

It does require that the https server being contacted support range requests, though!

@satra would know more about this than me. At first glance, though, it seems like it might work!

Thank you for the response, @bmaranville!

satra commented

i think this is less about the ROS3 driver per se, and more about exposing a remote http or s3 object as an in memory object or a streamable object. in python the s3fs library uses direct calls to expose an s3 object in memory handling the translation of in-memory access calls to requests behind the scenes.

the challenge here is that the hdf5 files can be really large 10s to 100s of GBs. thus the main requirement is that any reading happens in streaming mode.

There's a working demo at https://bmaranville.github.io/lazyFileLRU/ with source at https://github.com/bmaranville/lazyFileLRU

  • It is mounting a remote file in the Emscripten filesystem, where it fetches blocks on demand as disk "reads" occurs.
  • You can choose the size of your LRU buffer, as well as the block size.
  • It's set to blocks of 1024 bytes by default just so you can watch the network activity and see it retrieving a bunch of blocks.

It's a very simplistic example - you can explore the contents of the file by clicking "load" to do the initial mount, then enter a path and click "get". There are still issues with it - the lazyFile algorithm ramps up the number of chunks fetched for large reads, and if the number of chunks being fetched exceeds the LRUsize then it stops working. (try getting the dataset at /60.0/DAS_logs/pointDetector/counts with the default settings, for instance)

I've spent some time playing around with the demo code—though I'm not able to get data from other URLs (e.g. "https://s3.us-east-2.amazonaws.com/hdf5ros3/GMODO-SVM01.h5", "https://dandiarchive.s3.amazonaws.com/blobs/43b/f3a/43bf3a81-4a0b-433f-b471-1f10303f9d35") because of CORS errors.

Screen Shot 2022-05-03 at 10 17 33 AM

The latter URL works with pynwb / h5py on ROS3 mode.

Do either of you have intuitions about solving this issue?

@bmaranville Also, can you include the dependencies for building the source in https://github.com/bmaranville/lazyFileLRU?

Apologies - I forgot to include the package.json - it is there now. You should be able to do npm install and npm run build now.

As for the CORS errors, that is something that mostly has to be worked out on the server side. I had to add the following directives (for Apache):

Header always set Access-Control-Allow-Origin "*"
Header always set Access-Control-Allow-Headers "origin, x-requested-with, content-type, range"
Header always set Access-Control-Allow-Methods "GET, OPTIONS"
Header always set Access-Control-Expose-Headers "Accept-Ranges, Content-Encoding, Content-Length, Content-Range"

You also have to disable compression on the server, if you want to allow range requests. I added a flag "?gzip=false" and a corresponding rewrite-rule on the server to disable gzip, but you would do something else for S3 undoubtedly.

satra commented

i'm not sure why this is showing a cors problem. here is an example cors test on a file on dandi:

$ curl -IXGET -H 'Origin: http://example.com' https://dandiarchive.s3.amazonaws.com/blobs/a4f/71d/a4f71d55-15e1-416b-b718-275a2fa470a7
HTTP/1.1 200 OK
x-amz-id-2: FJXd6TBWzopDmce+gl1QjyeR4pJQxSJyGaRZOHy/wZj+KCvqmp0g8/0fUketGUw6eVF4LXdui00=
x-amz-request-id: JH7NBFXNJXPVY1X8
Date: Tue, 03 May 2022 21:56:26 GMT
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: PUT, POST, GET, DELETE
Access-Control-Expose-Headers: ETag
Access-Control-Max-Age: 3000
Vary: Origin, Access-Control-Request-Headers, Access-Control-Request-Method
Last-Modified: Mon, 18 Apr 2022 14:39:52 GMT
ETag: "a5017194dcc664aeb1dcb9866199e142-8"
x-amz-version-id: phrH8qPN78ho5DvP0uZtmLtLi7YqK6tk
Accept-Ranges: bytes
Content-Type: binary/octet-stream
Server: AmazonS3
Content-Length: 476850135

The current implementation is making a HEAD request before any of the GET requests... Do you know if that is where it is failing?

Also I noticed that OPTIONS is not in the list of approved methods, and that might be needed for CORS

Edit: ah, I see you're making a HEAD request in your example so that's probably not it.

satra commented

it only supports GET at the moment. HEAD is missing. i'll add that to the setup on our side.

It looks like simple range requests (with only a single, well-defined range) will be supported without an OPTIONS pre-flight on most browsers - that would speed things up. whatwg/fetch#1312

@satra @bmaranville It's a quick and dirty solution, but I've forked the lazyFileLRU to add a fallback to an asynchronous GET request (using Fetch) that aborts after reading the file headers. The source code is at https://github.com/garrettmflynn/lazyFileLRU and a demo with a 5GB file from DANDI at https://garrettflynn.com/lazyFileLRU/.

clever!

@garrettmflynn I was trying your implementation and it was not falling back to "GET" because the xhr request for "HEAD" in the try {} block doesn't throw an error if the request fails at the server or is blocked by CORS restrictions in the browser (just returns status with error code). It could be converted into a simple if/else block instead of try/catch - this then worked for me:

    // can't set Accept-Encoding header :( https://stackoverflow.com/questions/41701849/cannot-modify-accept-encoding-with-fetch
    xhr.open("HEAD", url, false);
    // // maybe this will help it not use compression?
    // xhr.setRequestHeader("Range", "bytes=" + 0 + "-" + 1e12);
    xhr.send(null);
    if (xhr.status >= 200 && xhr.status < 400) {
      datalength = Number(
        xhr.getResponseHeader("Content-length")
      );

      hasByteServing = xhr.getResponseHeader("Accept-Ranges") === "bytes";
      encoding = xhr.getResponseHeader("Content-Encoding");
    }
    else {
      console.log("HEAD request failed... falling back to aborted GET");
      const controller = new AbortController();
      const signal = controller.signal;

      await fetch(url, { signal }).then(response => {
        datalength = Number(response.headers.get("Content-length"));
        hasByteServing = response.headers.get("Accept-Ranges") === "bytes";
        encoding = response.headers.get("Content-Encoding");
        controller.abort();
      }).catch(this.#ready.reject)
    }