semarglproject/semargl

RDFa 1.0 xmlns namespace not parsed ?

tfrancart opened this issue · 3 comments

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:eli="http://data.europa.eu/eli/ontology#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" version="XHTML+RDFa 1.0" lang="fr">
	<head>
		<title>xxx</title>
		<meta property="eli:passed_by" content="Foo" />
	</head>
	
	<body>
	</body>
</html>

Parsed with the following code :

Model model = ModelFactory.createDefaultModel();			
StreamProcessor streamProcessor = new StreamProcessor(RdfaParser.connect(JenaSink.connect(model)));
			
nu.validator.htmlparser.sax.HtmlParser reader = new nu.validator.htmlparser.sax.HtmlParser(XmlViolationPolicy.ALTER_INFOSET);
streamProcessor.setProperty(StreamProcessor.XML_READER_PROPERTY, reader);
			
streamProcessor.process(htmlPage.openStream(), htmlPage.toString());
return model;

Returns :

<file:/home/thomas/temp/test.html>
        <eli:passed_by>  "Foo"@fr .

Note how the prefix "eli" is not resolved. Are the prefix declarations using xmlns supported ? setting .setProperty(RdfaParser.RDFA_VERSION_PROPERTY, RDFa.VERSION_10) doesn't change.

Is there anything I could do in the code to parse the above HTML without changing it ? if no, does anyone sees which modifications need to be done in the XHTML above ?

Thanks a lot !

Actually, I think the problem is in nu.validator.htmlparser.sax.HtmlParser that does not pass in the SAX events corresponding to the xmlns: declarations. The situation is a bit confuse because HTML, strictly speaking and as far as I can see, does not allow xmlns declarations, other than the html namespace. So I don't know what should happen if an alternate DTD is declared like in this case.

The same happens when preprocessing the HTML using TagSoup as suggested in #37. TagSoup removes the xmlns declarations.

hello,
I am getting an error at: "JenaSink.connect(model)" point. Error says: "The method connect(com.hp.hpl.jena.rdf.model.Model) in the type JenaSink is not applicable for the arguments (org.apache.jena.rdf.model.Model)"
Please help me with the problem.