dbpedia/extraction-framework

Invalid percent encoding

elad-shaked opened this issue · 3 comments

Line 184839 In https://downloads.dbpedia.org/repo/lts/transition/links/2019.02.01/links_domain=yago_lang=en.nt.bz2

<http://dbpedia.org/resource/555%> <http://www.w3.org/2002/07/owl#sameAs> <http://yago-knowledge.org/resource/555%25> .

Subject has a percent character at the end without any trailing encoding.
This fails at:
http://akswnc7.informatik.uni-leipzig.de:8088/
http://sparql.org/iri-validator.html

But succeeds at:
http://ttl.summerofcode.be/

Hi thank you are right this is not correct. For transition there is no parsing enabled at the moment, since these are legacy artifact kept for reference. I leave the issue open because it needs to be clarified whether this triple is still extracted in the new releases.

Can we consider https://en.wikipedia.org/wiki/555% as a correct IRI? Or percent sign must be also encoded as %25 in it?
Because it seems to me that for dbr triples this problem is fixed but there are still triples with not encoded percent:

http://en.wikipedia.org/wiki/555% | http://xmlns.com/foaf/0.1/primaryTopic | http://dbpedia.org/resource/555%25
http://en.wikipedia.org/wiki/555% | http://purl.org/dc/elements/1.1/language | en 
http://en.wikipedia.org/wiki/555% | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://xmlns.com/foaf/0.1/Document

as I said it is incorrect and should not be extracted like this but escaped. The strategy is to not escape any Unicode character unless it is violating the IRI standard (which it is in this case).