jjjake/internetarchive

Running in a browser with Pyodide

rth opened this issue · 5 comments

rth commented

Following a discussion on Twitter, there was interest in seeing what it takes to run this package in the browser with Pyodide (which among other things would allow calling the Python API from JavaScript).

So for instance, if you try to install it from PyPI in the Pyodide REPL,

await micropip.install('internetarchive')

you would get an error about missing wheel for docopt, since Pyodide only supports installation from wheels currently. This can be worked around by installing the dependencies explicitly,

await micropip.install(['internetarchive', 'six', 'requests', 'urllib3', 'charset_normalizer', 'idna', 'certifi', 'tqdm', 'jsonpatch', 'jsonpointer'], deps=False)
import internetarchive

which is sufficient to import the package.

If you actually try to use it you would get an error when trying to make an HTTP request,

from internetarchive import get_item
item = get_item('nasa')

The error is SSLError("Can\'t connect to HTTPS URL because the SSL module is not available.")

``` Traceback (most recent call last): File "/lib/python3.10/site-packages/urllib3/connectionpool.py", line 692, in urlopen conn = self._get_conn(timeout=pool_timeout) File "/lib/python3.10/site-packages/urllib3/connectionpool.py", line 281, in _get_conn return conn or self._new_conn() File "/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1011, in _new_conn raise SSLError( urllib3.exceptions.SSLError: Can't connect to HTTPS URL because the SSL module is not available. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/lib/python3.10/site-packages/urllib3/connectionpool.py", line 815, in urlopen return self.urlopen( File "/lib/python3.10/site-packages/urllib3/connectionpool.py", line 815, in urlopen return self.urlopen( File "/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='archive.org', port=443): Max retries exceeded with url: /metadata/nasa (Caused by SSLError("Can't connect to HT TPS URL because the SSL module is not available.")) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/lib/python3.10/site-packages/six.py", line 719, in reraise raise value File "/lib/python3.10/site-packages/internetarchive/session.py", line 553, in send r = super(ArchiveSession, self).send(request, **kwargs) File "/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/lib/python3.10/site-packages/requests/adapters.py", line 563, in send raise SSLError(e, request=request) requests.exceptions.SSLError: (MaxRetryError('HTTPSConnectionPool(host=\'archive.org\', port=443): Max retries exceeded with url: /metadata/nasa (Caused by SSLError("Can\' t connect to HTTPS URL because the SSL module is not available."))'), 'https://archive.org/metadata/nasa') ```
but the cause is incorrect, the actual issue is that the browser sandbox doesn't allow to open sockets to make a connection which the request is trying to do. Generally one has to instead use the JavaScript API to make network calls. These are wrapped in pyodide via [`pyodide.http.open_url`](https://pyodide.org/en/stable/usage/api/python-api/http.html#pyodide.http.open_url) (sync) and [`pyodide.http.pyfetch`](https://pyodide.org/en/stable/usage/api/python-api/http.html#pyodide.http.pyfetch) (async).

So the solution is either to,

I'm not familiar with the internetarchive Python package, from a cursory glance I would say that given that you use the requests API quite extensively, it would take some work to make it work in Pyodide with these alternative versions of requests (and probably improving one of those libraries) but it's not impossible.

Now as to whether this makes sense as a replacement for a JS library, hard to say as I don't know your use case well.

If you have any questions let me know.

rth commented

Another constraint I forgot to mention is that Javascript APIs only allow fetching text files synchronously, while binary files need to be fetched async in the main thread (or in a webworker where request can be sync). I'm not sure if you have a lot if binary files in the API of it's mostly text/json etc based.

Thanks for the mention. The goal of the pyodide-http package is to patch requests in such a way that packages like internetarchive works without changes (except for the patch_all invoke).
Of course there are some limitations when doing requests in the browser. Things like certificate checking is impossible and handled at browser level. Also some headers are not available without a Access-Control-Expose-Headers header. I haven't tried it but I can imagine this gives issues with cross-origin cookies.

Another constraint I forgot to mention is that Javascript APIs only allow fetching text files synchronously, while binary files need to be fetched async in the main thread (or in a webworker where request can be sync). I'm not sure if you have a lot if binary files in the API of it's mostly text/json etc based.

This issue is solved in latest version of pyodide-http. I added an example of fetching binary data in the main thread here: https://github.com/koenvo/pyodide-http/blob/main/tests/pyscript.html . This is solved here: https://github.com/koenvo/pyodide-http/blob/main/pyodide_http/_core.py#L47

A proper way to solve fetching binary data in the main thread is by using Atomics.wait (I think). More info about this approach can be found here: koenvo/pyodide-http#5

Thank you for helping! there has been a bunch of internal discussion at the Internet Archive about how to work with pyodide and javascript in general (the async issue and requests).

Thanks @rth and @koenvo! This is helpful, I'll take closer look and let you know if we have any questions!

Hi!

I tried to modify an example @jjjake made a couple weeks back - now it works with networking.

I haven't tested everything, but here is a demo: https://archive.org/~merlijn/pyia/pyodide-demo.html

The main problem seems to be that the internetarchive library tries to install/mount its own http adapter, just stubbing it out makes things work.

I haven't tried to perform any write actions, but it seems like this can work.

I wrote this to stub out the call that sets the http adapter:

from internetarchive import get_session
import internetarchive.session

class CustomSession(internetarchive.session.ArchiveSession):
    def mount_http_adapter(self, *args, **kwargs):
        print('no mount http adapter')

sess = CustomSession(None, "", False, {})

i = sess.get_item(js.code.value)
js.output.value += str(i.exists) + chr(10)