Unexpected characters
Opened this issue · 6 comments
U+FFFD
at line 22630305
in generic/infobox-properties/2020.10.01/infobox-properties_lang=en.ttl.bz2
<http://dbpedia.org/resource/Post-m�odern_art>
U+FFF9
at line 15457785
in generic/labels/2020.10.01/labels_lang=en.ttl.bz2
<http://dbpedia.org/resource/
U+FFFD
at line 1036836
in mappings/mappingbased-literals/2020.10.01/mappingbased-objects_lang=en.ttl
<http://dbpedia.org/resource/Post-m�odern_art>
Hi, Lissandrini @kuzeko . Could you please precise the problem you've found? I tried to trace back the files mentioned in the issue, but seem very confusing. I'd like to help you if there's more information shared.
Hi @StuartCHAN ,
the files mentioned contain invalid characters at the provided lines and thus fail to be imported by, e.g., Jena.
The files can be downloaded offical repo at https://downloads.dbpedia.org/repo/dbpedia
, e.g.,
https://downloads.dbpedia.org/repo/dbpedia/generic/infobox-properties/2020.10.01/infobox-properties_lang=en.ttl.bz2
In the case of <http://dbpedia.org/resource/Post-m�odern_art>
it seems to me that the problem is in wikipedia articles because they contain those invalid characters. If we check the source code of the article (https://en.wikipedia.org/wiki/Ahmed_Mater) from which this data was extracted, we can see that character �
is in it:
{{Infobox artist
| name = Ahmed Mater
| image = AHMED MATER SAUDIARABIA 2004.jpg
| caption = Ahmed Mater
| birth_date = {{Birth date and age|1979|7|25}}
| birth_place = [[Abha]], Saudi Arabia
| nationality = [[Saudi Arabia|Saudi]]
| movement = [[Post-m�odern art]]; [[Hurufiyya movement]]
| patrons =
| field = [[Conceptual art]], [[installation art]], painting
This is a source of the error, so the issue is that the script is missing URL santization because otherwise the exported data is invalid
This has been fixed in Wikipedia: https://en.wikipedia.org/wiki/Ahmed_Mater
A test for � / U+FFFD should be added to construct validation testing.