Only ASCII support in rdf:HTML datatype
tistre opened this issue · 6 comments
Thanks for this very useful tool! I’m trying to turn this RDFa into RDF/XML using scripts/localRDFa.py (note the Unicode ellipsis characters):
<!DOCTYPE html>
<html lang="en">
<body prefix="schema: http://schema.org/">
<div class="entry" resource="http://example.com/blog/1" typeof="schema:BlogPosting">
<h2 property="schema:headline">Unicode is accepted here…</h2>
<div property="schema:articleBody" datatype="rdf:HTML">… but not here!</div>
</div>
</body>
</html>
It fails with these error messages:
[digicol@timsdcxvm pyrdfa3-master]$ scripts/localRDFa.py -p /tmp/unicode.html
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/pyRdfa/__init__.py", line 648, in graph_from_source
return self.graph_from_DOM(dom, graph, pgraph)
File "/usr/lib/python2.6/site-packages/pyRdfa/__init__.py", line 501, in graph_from_DOM
parse_one_node(topElement, default_graph, None, state, [])
File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 67, in parse_one_node
_parse_1_1(node, graph, parent_object, incoming_state, parent_incomplete_triples)
File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 289, in _parse_1_1
_parse_1_1(n, graph, object_to_children, state, incomplete_triples)
File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 289, in _parse_1_1
_parse_1_1(n, graph, object_to_children, state, incomplete_triples)
File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 289, in _parse_1_1
_parse_1_1(n, graph, object_to_children, state, incomplete_triples)
File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 275, in _parse_1_1
ProcessProperty(node, graph, current_subject, state, typed_resource).generate_1_1()
File "/usr/lib/python2.6/site-packages/pyRdfa/property.py", line 126, in generate_1_1
object = Literal(self._get_HTML_literal(self.node), datatype=HTMLLiteral)
File "/usr/lib/python2.6/site-packages/rdflib-4.0.1-py2.6.egg/rdflib/term.py", line 564, in __new__
_value, _datatype = _castPythonToLiteral(value)
File "/usr/lib/python2.6/site-packages/rdflib-4.0.1-py2.6.egg/rdflib/term.py", line 1386, in _castPythonToLiteral
return castFunc(obj), dType
File "/usr/lib/python2.6/site-packages/rdflib-4.0.1-py2.6.egg/rdflib/term.py", line 1319, in _writeXML
if s.startswith(b(u'<?xml version="1.0" encoding="utf-8"?>')):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 38: ordinal not in range(128)
Traceback (most recent call last):
File "scripts/localRDFa.py", line 126, in <module>
print processor.rdf_from_sources(value, outputFormat = format, rdfOutput = rdfOutput)
File "/usr/lib/python2.6/site-packages/pyRdfa/__init__.py", line 685, in rdf_from_sources
self.graph_from_source(name, graph, rdfOutput)
File "/usr/lib/python2.6/site-packages/pyRdfa/__init__.py", line 657, in graph_from_source
if not rdfOutput : raise b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 38: ordinal not in range(128)
If I remove the Unicode ellipsis character from the schema:articleBody, the HTML parses fine. It doesn’t hurt in the schema:headline.
I don’t know Python (yet) so I’m reporting this here, hoping that someone has the time for a hopefully quick fix. Thanks for looking into this!
Tim,
I am on vacations right now, so I cannot really look at it for another two weeks. However... I suspect I know the answer. As it has been discussed on the core
RDFLib mailing list, the latest version of the HTML5Lib has a bug in handling unicode characters. Unfortunately, while the HTML5Lib people handle that, we have
to rely on an earlier (I think it was 0.95) version that worked without problems.
I will have to update the Readme file on the github repository; I will do that when I am back.
I hope this answers your question/issue (even if, I know, it is not a very nice situation...)
Thanks for your nice words on the tool itself!
Sincerely
Ivan Herman
On 2013-7-15 13:46 , Tim Strehle wrote:
Thanks for this very useful tool! I’m trying to turn this RDFa into RDF/XML using scripts/localRDFa.py (note the Unicode ellipsis characters):
|
|Unicode is accepted here…
… but not here!It fails with these error messages:
|[digicol@timsdcxvm pyrdfa3-master]$ scripts/localRDFa.py -p /tmp/unicode.html
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/pyRdfa/init.py", line 648, in graph_from_source
return self.graph_from_DOM(dom, graph, pgraph)
File "/usr/lib/python2.6/site-packages/pyRdfa/init.py", line 501, in graph_from_DOM
parse_one_node(topElement, default_graph, None, state, [])
File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 67, in parse_one_node
_parse_1_1(node, graph, parent_object, incoming_state, parent_incomplete_triples)
File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 289, in _parse_1_1
_parse_1_1(n, graph, object_to_children, state, incomplete_triples)
File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 289, in _parse_1_1
_parse_1_1(n, graph, object_to_children, state, incomplete_triples)
File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 289, in _parse_1_1
_parse_1_1(n, graph, object_to_children, state, incomplete_triples)
File "/usr/lib/python2.6/site-packages/pyRdfa/parse.py", line 275, in _parse_1_1
ProcessProperty(node, graph, current_subject, state, typed_resource).generate_1_1()
File "/usr/lib/python2.6/site-packages/pyRdfa/property.py", line 126, in generate_1_1
object = Literal(self._get_HTML_literal(self.node), datatype=HTMLLiteral)
File "/usr/lib/python2.6/site-packages/rdflib-4.0.1-py2.6.egg/rdflib/term.py", line 564, in new
_value, _datatype = _castPythonToLiteral(value)
File "/usr/lib/python2.6/site-packages/rdflib-4.0.1-py2.6.egg/rdflib/term.py", line 1386, in _castPythonToLiteral
return castFunc(obj), dType
File "/usr/lib/python2.6/site-packages/rdflib-4.0.1-py2.6.egg/rdflib/term.py", line 1319, in _writeXML
if s.startswith(b(u'')):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 38: ordinal not in range(128)
Traceback (most recent call last):
File "scripts/localRDFa.py", line 126, in
print processor.rdf_from_sources(value, outputFormat = format, rdfOutput = rdfOutput)
File "/usr/lib/python2.6/site-packages/pyRdfa/init.py", line 685, in rdf_from_sources
self.graph_from_source(name, graph, rdfOutput)
File "/usr/lib/python2.6/site-packages/pyRdfa/init.py", line 657, in graph_from_source
if not rdfOutput : raise b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 38: ordinal not in range(128)
|If I remove the Unicode ellipsis character from the schema:articleBody, the HTML parses fine. It doesn’t hurt in the schema:headline.
I don’t know Python (yet) so I’m reporting this here, hoping that someone has the time for a hopefully quick fix. Thanks for looking into this!
—
Reply to this email directly or view it on GitHub #6.
Ivan Herman
Bankrashof 108
1183NW Amstelveen
The Netherlands
tel: +31-64-1044153
http://www.ivan-herman.net
Ivan,
thanks a lot for the quick reply. It’s not urgent, enjoy your vacation :-)
This page told me how to downgrade with “pip install html5lib==0.95”:
Even though “pip list” now says:
html5lib (0.95)
pyRdfa (3.4.3)
rdflib (4.0.1)
… the above example still fails for me. But I might be doing something wrong.
Kind regards,
Tim
Tim,
I have tried it on my local machine (which runs 0.95), and indeed there seems to be a problem. Let me look into this when I am back to work!
Cheers
Ivan
On 2013-7-16 23:27 , Tim Strehle wrote:
Ivan,
thanks a lot for the quick reply. It’s not urgent, enjoy your vacation :-)
This page told me how to downgrade with “pip install html5lib==0.95”:
Even though “pip list” now says:
html5lib (0.95)
pyRdfa (3.4.3)
rdflib (4.0.1)… the above example still fails for me. But I might be doing something wrong.
Kind regards,
Tim—
Reply to this email directly or view it on GitHub #6 (comment).
Ivan Herman
Bankrashof 108
1183NW Amstelveen
The Netherlands
tel: +31-64-1044153
http://www.ivan-herman.net
Sigh...
I hope I have handled it although, I must say, it is pretty much of a hack
because there are some mysterious things going on with the encoding of unicode
strings, utf-8 and all that mess. In python3 this ought to be much better.
In case you use the version on git, it should be updated now. In case you use
the service on the W3C web site, I will have to get back to the system guys to
make an update for me, that will not happen before next week...
Thanks!
Ivan
Tim Strehle wrote:
Ivan,
thanks a lot for the quick reply. It’s not urgent, enjoy your vacation :-)
This page told me how to downgrade with “pip install html5lib==0.95”:
Even though “pip list” now says:
html5lib (0.95)
pyRdfa (3.4.3)
rdflib (4.0.1)… the above example still fails for me. But I might be doing something wrong.
Kind regards,
Tim—
Reply to this email directly or view it on GitHub
#6 (comment).
Ivan Herman
4, rue Beauvallon, Clos St. Joseph
13090 Aix-en-Provence
France
tel: +31-64-1044153 ou +33 6 52 46 00 43
http://www.ivan-herman.net
Thanks a lot, the latest git master branch works fine now!
[digicol@timsdcxvm pyrdfa3-master]$ scripts/localRDFa.py -p /tmp/unicode.html
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:schema="http://schema.org/"
>
<schema:BlogPosting rdf:about="http://example.com/blog/1">
<schema:articleBody rdf:datatype="http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML">… but not here!</schema:articleBody>
<schema:headline xml:lang="en">Unicode is accepted here…</schema:headline>
</schema:BlogPosting>
</rdf:RDF>
:-)
Ivan
On Jul 30, 2013, at 21:36 , Tim Strehle notifications@github.com wrote:
Thanks a lot, the latest git master branch works fine now!
[digicol@timsdcxvm pyrdfa3-master]$ scripts/localRDFa.py -p /tmp/unicode.html
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:schema="http://schema.org/"… but not here!/schema:articleBody
Unicode is accepted here…/schema:headline
/schema:BlogPosting
/rdf:RDF—
Reply to this email directly or view it on GitHub.