dbpedia/extraction-framework

User friendly Linked Data for HTTPS identifiers

namedgraph opened this issue · 9 comments

Issue validity

Live data on dbpedia.org.

Error Description

There is a http:///https:// mismatch between requested URIs and the URIs in the data.

Details

Originally reported here: https://sourceforge.net/p/dbpedia/mailman/message/37362683/

The server forces https:// URLs:

$ curl -I -H "Accept: text/turtle" http://dbpedia.org/resource/Copenhagen
HTTP/1.1 303 See Other
Server: nginx/1.18.0
Date: Thu, 07 Oct 2021 09:11:29 GMT
Content-Type: text/html
Content-Length: 153
Connection: keep-alive
Location: https://dbpedia.org/resource/Copenhagen
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: HEAD, GET, POST, OPTIONS
Access-Control-Allow-Headers:
Depth,DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Accept-Encoding

But the returned RDF data contains http:// URIs:

$ curl -o - https://dbpedia.org/data/Copenhagen.ttl
@prefix dbo:    <http://dbpedia.org/ontology/> .
@prefix dbr:    <http://dbpedia.org/resource/> .
<http://dbpedia.org/resource/2011\u201312_West_Ham_United_F.C._season>
 dbo:wikiPageWikiLink    dbr:Copenhagen .
<http://dbpedia.org/resource/AEK_Athens_F.C._in_European_football>
 dbo:wikiPageWikiLink    dbr:Copenhagen .
dbr:Adform      dbo:wikiPageWikiLink    dbr:Copenhagen .
dbr:Helena_Paparizou    dbo:wikiPageWikiLink    dbr:Copenhagen .
dbr:MS_Jutlandia        dbo:wikiPageWikiLink    dbr:Copenhagen .

Another example, this time requesting https://:

$ curl -L -OJ -H "Accept: text/turtle" https://dbpedia.org/resource/Copenhagen
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   153  100   153    0     0    725      0 --:--:-- --:--:-- --:--:--   725
100  675k  100  675k    0     0  1139k      0 --:--:-- --:--:-- --:--:-- 3235k
curl: Saved to filename 'sparql_2021-10-29_10-31-22Z.ttl'

$ cat sparql_2021-10-29_10-31-22Z.ttl
@prefix dbo:    <http://dbpedia.org/ontology/> .
@prefix dbr:    <http://dbpedia.org/resource/> .
dbr:Vivi_Bach   dbo:birthPlace  dbr:Copenhagen .
...

Hi @namedgraph (nice user name by the way) can you please elaborate, why you think that DBpedia Linked Data interface is broken? I consider the HTTPS URL of a resource just as a special "generic document" that describes the non-information URI (NIR) aka "resource ID". See the image below from cool URIs. In other words our resource IDs are non-HTTPS. HTTPS is just used as (mandatory - this might be discussed) security layer.

image

Linked Data is about self-describing resources.
If http://dbpedia.org/resource/Copenhagen is requested, RDF data with http://dbpedia.org/resource/Copenhagen in the subject position (and possibly additional resource descriptions) should be returned.
If https://dbpedia.org/resource/Copenhagen is requested, RDF data about https://dbpedia.org/resource/Copenhagen should be returned.
http://dbpedia.org/resource/Copenhagen and https://dbpedia.org/resource/Copenhagen are two distinct resources in RDF since their URIs differ.

As my examples show, when http:// is requested, the server redirects to https:// but then returns data about http:// anyway.
When https:// is requested, the data is still about http://.

See the email thread for more details.

@namedgraph there seems to be still a lot of confusion here.

From an RDF perspective https://dbpedia.org/resource/Berlin does not exist as a resource. It is only the URL of the generic document that delivers the description ( of http://dbpedia.org/resource/Berlin). We don't use https based RDF resource identifiers because of the simple reason you mentioned (string identity in RDF) -- so far. So again http://dbpedia.org/resource/ is the RDF namespace and https://dbpedia.org/resource/ is no RDF namespace (and these https URIs should never occur in any kind of RDF data, and therefore should be never looked up by any linked client directly!) To be more clear lets have a look again at the Alice example from above which translates to the following.

http://dbpedia.org/resource/Berlin ~ http://www.example.com/id/alice
https://dbpedia.org/resource/Berlin ~ http://www.example.com/doc/alice
https://dbpedia.org/data/Berlin.ttl ~ http://www.example.com/doc/alice.rdf

I see that this might be not so clear on the very first look since both namespaces look very similiar and not so explicitly different as in the cool uris example.

Based on your email conversation and this github issue I understood the following problems / request. But in the end we need you to show what actual problems do you have. So which particular client breaks and why.

  • A: enforcing https breaks linked data clients that do not support https. I understand your request to not have https enforced for linked data requests (rdf mime-types in accept header)? I would agree with this, and I think this could be discussed, especially when looking at problem C we need to change something anyhow. @kurzum @pkleef
  • B: You would like to have an addtional triple http://dbpedia.org/resource/Berlin owl:sameAs https://dbpedia.org/resource/Berlin for all resources. My question here is why does this help? For me this would only lead to other problems. You always need linked data clients and tools that support inference and I am afraid that people start to use 2 different identifiers for the same thing which definitely makes a lot of things more tricky and can break stuff (just imagine you would need to change also the class identifiers of the dbpedia ontology to https, then owl:sameAs wont help you would need owl:equivalentClass or materialize all type statemens with https and without https. And what happens to datatypes?

C: But when looking at the redirect chain I think I identified an actual problem. Fallback to http which does not make sense to me (?) @pkleef @kurzum maybe this is what actually break clients (I remember if you download files with native java from the databus/collections with the databus file identifiers which use https, you can have a problem with redirects that point to non-https download locations (so download url is not https) dbpedia/dbpedia-databus-collection-downloader@6091021) ~~
http://dbpedia.org/resource/Berlin --[303]--> https://dbpedia.org/resource/Berlin --[303]--> http://dbpedia.org/data/Berlin.ttl -[303]-> https://dbpedia.org/data/Berlin.ttl
Fix option 1: https not enforced
http://dbpedia.org/resource/Berlin --[303]--> http://dbpedia.org/data/Berlin.ttl
Fix option 2: https enforced
http://dbpedia.org/resource/Berlin --[303]--> https://dbpedia.org/resource/Berlin --[303]--> https://dbpedia.org/data/Berlin.ttl

see #722

So essentially DBPedia's http:// identifiers are canonical, and https:// should not be used and only occur behind the scenes during the redirects?

We also have encountered variations on this issue.
Browsers increasingly look deep into a web transaction.
If the browser detects an http:// resource it might get flagged (or blocked).
This was true when using SPARQLer (recently upgraded to https://).
However, we've seen instances of http:// endpoints in SPARQL queries fail when fetched using http://

I think the easiest way to encounter this issue is just to grab the URL from the browser's address bar, which after the redirects is the https:// URL, and then use it somewhere else, like in a Linked Data browser.

You can rationalize that "this is not the canonical URL", but people just expect it to work.

I agree the Linked Data and Semantic Web practices and standards are quite old, not easy to understand and not always super user friendly. IMO it was not designed to be consumed by humans and use cases like your copy and paste browser usage.
DBpedia exists since 2007 and the feature you request has a lot of pitfalls and can break a lot of things or make the identifiers even more confusing or just wrong in the future (if you copy it from the browser you get the ID of the html page, not of the entity, sorry but that is just a semantic difference that is in place for a very long time, not DAU friendly though I totally see that). If a project starts from scratch now it can just go with HTTPS-only identifiers and then all this trouble is not an issue

I tried it with Wikidata and what you request also seems not to work there neither via SPARQL nor via Linked Data
Also Github has a separate "raw" namespace to download files and separates between files content and html presenation of the file.

To move forward, I spitted the issue into the "actual" bug I discovered (#722) and your feature request.

I wouldn't blame the Semantic Web for this, as RDF doesn't really care about http:// or https:// :)

I would attribute this to legacy conventions/technical debt. As you mentioned the issue would be solved by making https:// canonical.

@JJ-Author another problem with http:// as canonical URIs is that they cannot be requested from a secure page.