digitalbazaar/pyld

IRI expansion with missing `@base` does not conform to RFC 3986

RinkeHoekstra opened this issue · 2 comments

RFC 3986 section 5.1 specifies that relative URIs should be expanded against the document's base URI. In absence of an explicit base, there are prescribed steps to determine the base IRI for a given document:

5.1.1. Base URI Embedded in Content . . . . . . . . . . 29
5.1.2. Base URI from the Encapsulating Entity . . . . . 29
5.1.3. Base URI from the Retrieval URI . . . . . . . . 30
5.1.4. Default Base URI . . . . . . . . . . . . . . . . 30

The current implementation in pyLD ignores the last two requirements. For 5.1.3 this is understandable, as the library only operates on a data payload. However, 5.1.4 is the catch-all that would ensure that @id values are always expanded to absolute IRIs.

In absence of this, non-IRI @id values in documents that do not explicitly specify a base in a context are not expanded to an absolute IRI. This means that the to_rdf function ignores them when producing N-Quads output. This is a showstopper for RDFLib/rdflib#2308.

The JSON-LD spec does allow for a means to prevent expansion against a base by setting @base to null (see https://www.w3.org/TR/json-ld/#base-iri) but does not specify that null is the default.

This violates test t0060 in and t0060.

The output should be something similar to (with a different application-specific base):

[
  {
    "@id": "https://w3c.github.io/json-ld-api/tests/document-relative",
    "@type": [ "https://w3c.github.io/json-ld-api/tests/expand/0060-in.jsonld#document-relative" ],
    "http://example.com/vocab#property": [
      {
        "@id": "http://example.org/document-base-overwritten",
        "@type": [ "http://example.org/test/#document-base-overwritten" ],
        "http://example.com/vocab#property": [
          {
            "@id": "https://w3c.github.io/json-ld-api/tests/document-relative",
            "@type": [ "https://w3c.github.io/json-ld-api/tests/expand/0060-in.jsonld#document-relative" ]
          },
          {
            "@id": "../document-relative",
            "@type": [ "#document-relative" ],
            "http://example.com/vocab#property": [ { "@value": "only @base is cleared" } ]
          }
        ]
      }
    ]
  }
]

But the output of pyld is:

  {
    "@id": "../document-relative",
    "@type": [
      "#document-relative"
    ],
    "http://example.com/vocab#property": [
      {
        "@id": "http://example.org/document-base-overwritten",
        "@type": [
          "http://example.org/test/#document-base-overwritten"
        ],
        "http://example.com/vocab#property": [
          {
            "@id": "../document-relative",
            "@type": [
              "#document-relative"
            ]
          },
          {
            "@id": "../document-relative",
            "@type": [
              "#document-relative"
            ],
            "http://example.com/vocab#property": [
              {
                "@value": "only @base is cleared"
              }
            ]
          }
        ]
      }
    ]
  }
]

The resulting N-Quads only returns a single triple:

http://example.org/document-base-overwritten> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://example.org/test/#document-base-overwritten> .

This is not a duplicate of #143 as that issue is about a case where the @base is specified.

The problem appears to reside here:

pyld/lib/pyld/jsonld.py

Lines 3186 to 3202 in 316fbc2

# handle @base
if '@base' in ctx:
base = ctx['@base']
if base is None:
base = None
elif _is_absolute_iri(base):
base = base
elif _is_relative_iri(base):
base = prepend_base(active_ctx.get('@base'), base)
else:
raise JsonLdError(
'Invalid JSON-LD syntax; the value of "@base" in a '
'@context must be a string or null.',
'jsonld.SyntaxError', {'context': ctx},
code='invalid base IRI')
rval['@base'] = base
defined['@base'] = True

Where in absence of a@base (or an explicit null base, see https://www.w3.org/TR/json-ld/#base-iri) a default base needs to be set.

I started wondering why the test suite doesn't pick this up, and the explanation is in the runtests.py file:

pyld/tests/runtests.py

Lines 259 to 264 in 316fbc2

# expand @id and input base
if 'baseIri' in manifest.data:
data['@id'] = (
manifest.data['baseIri'] +
os.path.basename(str.replace(manifest.filename, '.jsonld', '')) + data['@id'])
self.base = self.manifest.data['baseIri'] + data['input']

Because the manifest files specify a baseIRI value, the test will always run with a base specified. This means that the situation reported in this issue is not recognised.

Rewriting the test is not an option as with an unspecified base IRI, the output will be application specific.