schema.org seems to break JSON::LD::Reader
markwilkinson opened this issue · 4 comments
Reader Class JSON::LD::Reader
#<JSON::LD::Reader:0x000055eb154712c8 @options={:encoding=>#<Encoding:UTF-8>, :validate=>false, :canonicalize=>false, :intern=>true, :prefixes=>{}, :base_uri=>nil}, @input=#<StringIO:0x000055eb15470b98>, @doc=#<StringIO:0x000055eb154707d8>>
FATAL Failed to parse input document: loading remote context failed: https://schema.org: undefined method `inner_html' for nil:NilClass: Called from /usr/local/bundle/gems/json-ld-3.1.0/lib/json/ld/reader.rb:90
If I .gsub the content of the JSON LD to replace schema.org with the URL of the context file, it no longer chokes on the message.
if formattype.to_s =~ /JSON\:\:LD\:\:Format/
body = body.gsub(/https?\:\/\/schema.org\/?/, "https://schema.org/docs/jsonldcontext.json")
$stderr.puts "new body\n\n#{body}"
end
This workaround is fine, but I'm curious what the root cause of the problem is...??
I'll need more info to reproduce this. The error indicates you're getting HTML back when accessing http://schema.org
, but content-negotiation should ensure that you get back JSON-LD. The default document loader always uses an Accept header to prioritize application/ld+json, so you shouldn't get HTML when doing this.
Still, you shouldn't get such an error. It could be that you're using the fallback REXML HTML parser and not Nokogiri, and the behavior of at_xpath
is subtly different.
Really, it's not good for packages to hit schema.org as part of normal processing, and you should require the json-ld-preloaded gem, which will avoid all this.
But, I'd like to be able to reproduce the error, so please give me a minimal script that does so.
Here's a Docker file that reproduces the problem (the same thing happens with Distiller, where I cannot control what libraries are loaded, so it's a bigger problem there!)
FROM ruby:2.7.0
ENV LANG="en_US.UTF-8" LANGUAGE="en_US:UTF-8" LC_ALL="C.UTF-8"
#RUN chmod a+r /etc/resolv.conf
RUN apt-get update -q
RUN apt-get install -y --no-install-recommends build-essential
RUN apt-get install -y --no-install-recommends libxml++2.6-dev libraptor2-0
RUN apt-get install -y --no-install-recommends libxslt1-dev locales software-properties-common
RUN apt-get clean
RUN rm -rf /var/lib/apt/lists/*
RUN locale-gen en_US.UTF-8
RUN gem update --system
RUN gem install rdf-trig rdf-raptor xml-simple parseconfig json rdf-json json-ld rdf-trig rdf-turtle rdf-rdfa sparql xml-simple nokogiri parseconfig rest-client cgi
RUN wget http://go-fair.org
RUN rdf serialize --input-format rdfa --output-format turtle index.html
There is a bug on the schema.org site (schemaorg/schemaorg#2468) that is not returning the JSON-LD context when given a profile parameter.
There's also a bug in the document loader that leads to this particular error, but it wouldn't show up if schema.org was returning the appropriate content.
If you use the json-ld-preloaded gem, it will avoid this pull, which is generally a good thing to do. An upcoming version of the linkeddata gem will pull this in so it will be availing with the "rdf" CLI, if that gem is included. I will update json-ld to attempt to load it as well, so if you add it to your Docker file gem install line, it should work properly. I'll let you know when an updated release is made.
I released new versions of json-ld (3.1.1) and linkeddata (3.1.1) gems, which require json-ld-preloaded as well as fix the loading bug found.
Note that without the preloaded context, this would continue to fail, until schemaorg/schemaorg#2468 is addressed.