Could not retrieve a JSON-LD document from the URL.

Question

Could not retrieve a JSON-LD document from the URL.

sharpaper opened this issue 4 years ago · 13 comments

For some reasons pyld is not able to fetch the context URL https://www.w3.org/ns/activitystreams. I was able to reduce the issue to the following basic example, which is not working:

import pyld
doc = {
    "@context": [ 'https://www.w3.org/ns/activitystreams' ],
    "type": "Follow" }
doc = pyld.jsonld.expand(doc)

[...]
raise JSONDecodeError("Expecting value", s, err.value) from None
[...]
Dereferencing a URL did not result in a valid JSON-LD object.
[...]
Type: jsonld.InvalidUrl
Code: loading remote context failed
Details: {'url': 'https://www.w3.org/ns/activitystreams', 'cause': JsonLdError('Could not retrieve a JSON-LD document from the URL.')}

I could not debug the issue. It looks like pyld might be retrieving the context without the correct headers, but looking at the code I do see headers = { 'Accept': ...} defined in several places.
Can you guys please help me understand if this is a bug or if I'm not using the library correctly? Thanks!

Answer 1 · 2020-08-01T08:42:02.000Z

If this can be any useful, when the "document loader" sends the .get() request to retrieve the remote context, see L63, the parameters are:

url = https://www.w3.org/ns/activitystreams
headers = {'Accept': 'application/ld+json;profile=http://www.w3.org/ns/json-ld#context, application/ld+json, application/json;q=0.5, text/html;q=0.8, application/xhtml+xml;q=0.8'}
kwargs = {}

Because the request accepts everything, the remote has selected HTML and therefore pyld gives the error. curl-ing the URI with 'Accept': 'application/ld+json;profile=http://www.w3.org/ns/json-ld#context, application/ld+json, application/json;q=0.5', after removing the HTML option from Accept, does work.
So I think the website might be at fault here because application/ld+json should take higher precedence over text/html;q=0.8, but on the other hand why is pyld requesting HTML? Can HTML be removed from the Accept headers somehow?

Answer 2 · 2020-08-01T15:40:05.000Z

Changing the q value of text/html here to 0.5, that is text/html;q=0.5, it fetches the document correctly. Anything above 0.5 and it doesn't work.

What the heck is going on? I don't understand.

Answer 3 · 2020-08-01T16:06:53.000Z

Can text/html and application/xhtml+xml be removed entirely from the Accept header? Why are they needed? Shouldn't objects always be retrieved with application/ld+json?

Answer 4 · 2020-08-01T16:56:14.000Z

W3C have had issues with their content negotiation setup before. Certainly, changing the priority for HTML could be a workaround, but the Accept header is fine.

The text/html is included, because a processor can extract JSON-LD from html, which is arguably the bulk of JSON-LD on the web.

@iherman might have a look at the server configuration for activitystreams.

Answer 5 · 2020-08-01T17:37:16.000Z

Unfortunately the text/html option is hard coded into the load_document function but this issue could be circumvented if issue #125 was fixed, as it would allow to configure custom headers during creation of the document loader.

Answer 6 · 2020-08-01T18:08:40.000Z

OK I was able to figure out a workaround with a custom loader, but issue #125 should really be fixed because it would make this process a lot simpler by allowing to specify headers directly in jsonld.set_document_loader(jsonld.requests_document_loader(timeout=..., headers=...)).

def myloader(*args, **kwargs):
    requests_loader = pyld.documentloader.requests.requests_document_loader(*args, **kwargs)
    
    def loader(url, options={}):
        options['headers']['Accept'] = 'application/ld+json'
        return requests_loader(url, options)
    
    return loader

pyld.jsonld.set_document_loader(myloader())

Answer 7 · 2020-08-03T04:43:52.000Z

The activitystreams.var file on the W3C site is as follows:

URI: activitystreams

URI: activitystreams.html
Content-Type: text/html

URI: activitystreams.jsonld
Content-Type: application/ld+json; qs=0.5

URI: activitystreams.jsonld
Content-Type: application/json; qs=0.4

this looks o.k. to me...

cc @gkellogg

Answer 8 · 2020-08-03T05:38:45.000Z

Is it possible that there is some kind of weights that are been evaluated before choosing the response content-type? I have no idea how these q and qs properties are used in practice, but maybe Apache is computing a "score" for each type? Something like... if the request is

'Accept': 'application/ld+json;profile=http://www.w3.org/ns/json-ld#context, application/ld+json, application/json;q=0.5, text/html;q=0.8, application/xhtml+xml;q=0.8'

and the score is q x qs, then:

text/html = 0.8 x 1.0 = 0.8
application/ld+json = 1.0 x 0.5 = 0.5
application/json = 0.5 x 0.4 = 0.2

This would explain why any value in the Accept header for text/html greater than 0.5 would fail to retrieve the jsonld document, since N x 1.0 = N is always greater (and thus higher priority) than the application/ld+json "score" of 0.5.

If this hypothesis is true, then I must say that it's a very messy situation because the client cannot adjust its weights for every website.

Answer 9 · 2020-08-04T04:45:03.000Z

To be honest: I do not know either. Maybe somebody with a better knowledge of how Apache works can advise.

Answer 10 · 2020-08-04T08:58:49.000Z

I'm thinking my hypothesis is indeed true. After taking a look at httpd source I've found two RFC (2295 and 2296) that say "The overall quality Q of a variant is the value Q = round5( qs * qt * qc * ql * qf )" where all the qs are various quality values. Note that the httpd source code says in the comments that all the quality values are taken from the request headers except for qs. Then there's also the Apache Negotiation Algorithm which I think it may be a slightly modified version of the one described in the RFC; anyway the step 2.1 of the algorithm is literally "Multiply the quality factor from the Accept header with the quality-of-source factor for this variants media type, and select the variants with the highest value.".

So the bottom line is that this probably has to be fixed in pyld. In particular I think it's an issue with the requests loader. If requests cannot accept text/html, then it should either replace the header with its own "application/ld+json" or as I said just fix #125 such that the headers can be defined by the user when creating a new loader.

Answer 11 · 2024-02-05T20:15:26.000Z

@iherman I believe you fixed this at the W3C recently, iirc (or was it for a different context file)?

Answer 12 · 2024-02-06T08:05:37.000Z

@iherman I believe you fixed this at the W3C recently, iirc (or was it for a different context file)?

Almost 😀. The settings have been changed, but not by me; the culpit is @pchampin

Answer 13 · 2024-02-07T08:21:59.000Z

@iherman I believe you fixed this at the W3C recently, iirc (or was it for a different context file)?

I confirm that this has been fixed