dbpedia/extraction-framework

Unexpected characters

Opened this issue · 6 comments

U+FFFD at line 22630305 in generic/infobox-properties/2020.10.01/infobox-properties_lang=en.ttl.bz2

<http://dbpedia.org/resource/Post-m�odern_art>

U+FFF9 at line 15457785 in generic/labels/2020.10.01/labels_lang=en.ttl.bz2

<http://dbpedia.org/resource/

U+FFFD at line 1036836 in mappings/mappingbased-literals/2020.10.01/mappingbased-objects_lang=en.ttl

<http://dbpedia.org/resource/Post-m�odern_art>

Hi, Lissandrini @kuzeko . Could you please precise the problem you've found? I tried to trace back the files mentioned in the issue, but seem very confusing. I'd like to help you if there's more information shared.

Hi @StuartCHAN ,

the files mentioned contain invalid characters at the provided lines and thus fail to be imported by, e.g., Jena.

The files can be downloaded offical repo at https://downloads.dbpedia.org/repo/dbpedia, e.g.,

https://downloads.dbpedia.org/repo/dbpedia/generic/infobox-properties/2020.10.01/infobox-properties_lang=en.ttl.bz2

In the case of <http://dbpedia.org/resource/Post-m�odern_art> it seems to me that the problem is in wikipedia articles because they contain those invalid characters. If we check the source code of the article (https://en.wikipedia.org/wiki/Ahmed_Mater) from which this data was extracted, we can see that character is in it:

{{Infobox artist
| name          = Ahmed Mater
| image         = AHMED MATER SAUDIARABIA 2004.jpg
| caption       = Ahmed Mater
| birth_date    = {{Birth date and age|1979|7|25}}
| birth_place   = [[Abha]], Saudi Arabia
| nationality   = [[Saudi Arabia|Saudi]]
| movement      = [[Post-m�odern art]]; [[Hurufiyya movement]]
| patrons       = 
| field         = [[Conceptual art]], [[installation art]], painting

This is a source of the error, so the issue is that the script is missing URL santization because otherwise the exported data is invalid

@Vehnem , do we need to remove characters like "�" during the post-processing?

This has been fixed in Wikipedia: https://en.wikipedia.org/wiki/Ahmed_Mater
A test for � / U+FFFD should be added to construct validation testing.