Add cross-language DBpedia IRI support

Question

Add cross-language DBpedia IRI support

Closed this issue 5 years ago · 10 comments

Would you consider switching from SPARQL to DBpedia's Same Thing Service for converting DBpedia URIs?

This should be quite a bit faster than the SPARQL endpoint, and provides links between Wikidata, DBpedia Global IRIs, and all localized DBpedias. The hosted service can take on hundreds of concurrent requests -- thousands if it's lucky with cache hits. For users who want the lowest possible latency, hosting their own instance should be a breeze with the provided docker-compose file.

An example lookup for an Esperanto URI:
https://global.dbpedia.org/same-thing/lookup/?meta=off&uri=http://eo.dbpedia.org/resource/Geometrio

I haven't used nifconverter yet, but I'm looking into it for converting some older entity linking datasets to more stable URIs. If this works out, I might be able to submit a PR.

(and yes: this is a shameless plug. I am the main author of the microservice.)

Answer 1 · 2019-06-06T13:34:03.000Z

@aolieman yes of course! I was not aware of this service.
In the meantime I also discovered that GERBIL is doing this conversion using a locally stored index:
https://github.com/dice-group/gerbil/blob/6b6ecbd27a849f6d94fa25d8e92a1c5f968e4089/start.sh#L150
It could also be useful to provide support for that - that would probably reduce the network traffic.

Potentially, various subclasses of URIConverter could be implemented, each using a different method to retrieve the corresponding URIs. But it would make sense that yours be the default one, since your service is designed for that.

Answer 2 · 2019-06-06T16:01:06.000Z

That sounds good, having several options for URI converters is a great idea :-)

Thanks for pointing out the index that Gerbil uses. I expected them to be using something else than SPARQL, but I wasn't yet aware of what exactly they are using. I think this index might include the English DBpedia only though. Perhaps @MichaelRoeder or @RicardoUsbeck could clarify this.

Please let me know if there is anything I can do to help.

Answer 3 · 2019-06-06T16:10:21.000Z

@aolieman if you are up for implementing an URIConverter which uses your service, I think it would be a great addition to the tool. You can probably reuse the tests for the existing SPARQL implementation, since the function is the same.

Answer 4 · 2019-06-06T18:11:22.000Z

[ EDITED ]
Hi all,

thanks for finding us. Actually, @TortugaAttack build the GERBIL service and may help here.
2a) Do you know https://www.sameas.cc/ which basically helps you to map any (not only DBpedia) URIs to wikidata.
2b) This https://zenodo.org/record/3227976 is a newer version of the same with fewer errors just published
I was also thinking of converting and publishing all GERBIL datasets (also some of the private once which then get public) which are based in DBpedia URIs to Wikidata URIs. Glad that this is your goal now as my student did not start yet. However, there are some issues to tackle if we want to have a really good gold datasets (e.g. outdated URIs). I would be happy to assist here. Shall we have a call?

Answer 5 · 2019-06-06T18:42:12.000Z

Hey
Regarding the Index:

The Online service of GERBIL is only using the english DBpedia for the index.
Thus the index is downloadable. Have to find the URI though.
However you can create your own sameAs Index and thus using other languages as well by modifying this script:
https://github.com/dice-group/gerbil/blob/master/index.sh

Using the following example the index would look like the following table (where URI is reasonably queryable)

uri1 sameAs uri0, uri4, uri7.
uri2 sameAs uri3.

URI	sameAsURIs
uri1	{uri0, uri4, uri7}
uri2	{uri3}

Be aware:
In GERBIL we need only one direction sameAs relations, hence we only index the subjects of the sameAs Relation with their sameAs resources and not the objects to the subjects as well -> only DBpedia uris are queryable.
It would be easy to modify this to make it work in both ways.

F.e.:

uri1 sameAs uri0, uri4, uri7.
uri2 sameAs uri3.

would result to the following using a bi directional way

URI	sameAsURIs
uri1	{uri0, uri4, uri7}
uri0	{uri1, uri4, uri7}
uri4	{uri0, uri1, uri7}
uri7	{uri0, uri1, uri4}
uri2	{uri3}
uri3	{uri2}

If you want to dig deeper on how the index looks like:
https://github.com/dice-group/gerbil/blob/master/src/main/java/org/aksw/gerbil/semantic/sameas/index/Indexer.java
and
https://github.com/dice-group/gerbil/blob/master/src/main/java/org/aksw/gerbil/semantic/sameas/index/Searcher.java

It is significantly faster than a HTTP SPARQL endpoint. Takes up to 2 days to create though.

Answer 6 · 2019-06-06T21:38:41.000Z

Happy to jump on a call whenever. I'd be keen to help convert datasets to Wikidata - however in most cases it will have to involve a significant annotation effort I am afraid, for entities that are in Wikidata and not in DBpedia… Datasets annotated with NIL links can help with that, since only these NIL links need to be inspected. I have done this for the RSS-500 dataset and the result can be found here:
https://github.com/wetneb/opentapioca/tree/master/data

Answer 7 · 2019-06-06T22:19:46.000Z

Hi all,

Thanks for such quick and in-depth responses. I wasn't expecting this much useful info.

@wetneb I'm certainly up for contributing a URIConverter that uses DBp STS. This month is going to be really busy though -- my first time organizing & giving a tutorial, on top of ongoing work -- so I can probably not get to this any earlier than the second half of July. I've set a reminder for myself in case this thread goes silent.

thanks for finding us. Actually, @TortugaAttack build the GERBIL service and may help here.

Well, thanks for developing GERBIL and spreading the word. It really takes a lot of the pain out of EL evaluation!

I was already familiar with sameas.cc, but when I looked into using it for mapping DBpedia Spotlight output to Wikidata, it seemed very incomplete on the Wikidata side (only 6 triples, if I'm not mistaken), and it was too slow to keep up with Spotlight's throughput. The coverage is still the same, but it has become a lot faster on the new triply.cc platform.

MetaLink is an impressive dataset which does include all sameAs links that we could possibly need right now. I think I found the equivalence class for the geometry example I gave above:
https://krr.triply.cc/krr/metalink/browser?resource=https%3A%2F%2Fkrr.triply.cc%2Fkrr%2Fsameas-meta%2Fid%2Fcomm%2F27346-0&focus=forward

For anyone targeting sameAs links beyond DBpedia and Wikidata, this would probably be the dataset to use. It looks like MetaLink can be updated in the future, but I haven't yet read anything about a regular release schedule. This is something the DBpedia community was looking for (monthly or bi-monthly releases), in the discussion that led to my work on the Same Thing Service.

@TortugaAttack thanks for the details on your index structure and its implementation! Had I been aware of this before I started building STS, I might have built on top of your code. My design did turn out to be essentially the same as yours, but with a bidirectional addition. I didn't want to go for a full bidirectional index in the way that you describe, with 125+ localized datasets and more mappings on the way.

Instead, all "local" URIs link to a DBpedia Global URI (i.e. the ID that represents the equivalence class), and the key of the global ID has as value all local URIs. This way, the lookup of a local URI (e.g. old-school DBpedia, Wikidata, or domain-specific) takes 1 or 2 milliseconds longer than the direct lookup of a global URI, but the same response is generated in either case, and the index can have a much smaller memory footprint.

I'm surprised to hear that it can take up to two days to create the index. I can't imagine that Lucene would form the bottleneck here. Just curious: what are you using as input? Could it simply be the parsing that takes ages?

I was also thinking of converting and publishing all GERBIL datasets (also some of the private once which then get public) which are based in DBpedia URIs to Wikidata URIs.

This would be very valuable, but indeed many of the benchmarks are starting to become very outdated and include URIs that no longer exist due to Wikipedia activity. If you're happy to have me on the call as well, I could summarize a discussion we had about this on the DBpedia Slack a couple of years ago (in #spotlight-live-paper), where some preliminary analyses were done and we came up with strategies to fix the URIs in the benchmarks using Wikipedia's edit history. Is something along those lines what you also had in mind @RicardoUsbeck ?

I'd be keen to help convert datasets to Wikidata - however in most cases it will have to involve a significant annotation effort I am afraid, for entities that are in Wikidata and not in DBpedia…

I read about your updates to RSS-500 in the OpenTapioca paper, great stuff! But indeed a whole lot of effort. Since a couple of entity linkers have now been published which have Wikidata as a direct target KB, a pooling approach is becoming increasingly feasible. This sounds big enough to me to justify a full evaluation campaign, perhaps involving GERBIL for the pooling of system links. I'm guessing that if we -- or whoever organizes this hypothetical campaign -- ensure that gold standard judgments will be created at the end, this should provide enough incentive for system creators to participate.

Answer 8 · 2019-06-07T08:30:55.000Z

Hey,
Lucene is not the bottleneck, it is my network speed :D
Basically it streams the bzip2 archives from the downloads.dbpedia url and decompress them on the fly.
The download speed is the bottleneck :/

Answer 9 · 2019-06-07T09:16:56.000Z

[EDITED]
Hi all,

I would like to add my two cents 😉

DBpedia Same Thing Service: This service looks pretty good 👍. The usage of a global IRI which represents a lot of local IRIs sounds very good to me. It may improve the runtime of our retrieval of sameAs links within GERBIL a lot 🤔

Updating datasets: I think the majority of the knowledge extraction community would agree that updating the datasets would be very good 👍. However, it looks like somebody has to play the role of the devil's advocate:

If you update datasets, it is very unlikely that you will be able to do that without manually checking them, simply because their current quality is not always very good (See https://svn.aksw.org/papers/2017/ESWC_EAGLET_2017/public.pdf for details)
The community has no common definition what should be extracted (i.e., marked as an entity). I guess one would simply have to choose from the different existing rules which entities should be marked in the texts and how this should be done. Following the rules of the single datasets (e.g., AIDA/CoNLL's 7 class extraction vs. typical Person/Place/Organisation approaches) might add a lot of additional effort.
@RicardoUsbeck "also some of the private once which then get public" will not work. The issue is not that the markings of entities are not public - typically they are. The problem is that the documents themselfs are not public. Even if you add a lot of effort to update the markings of entities, you can not change the existing license under which the texts have been published. A famous example are the CoNLL datasets for which you can download the entity markings but you need the original texts to generate the gold standard. These original texts are not publicly available without paying for them.

Future directions: from the perspective of the GERBIL project, we would like to use an external service, since the retrieval of IRIs representing the same entity is a common issue and it simply adds a complexity to GERBIL that we do not really want to have. So we are happy to colaborate if we can get a service that gives us the effectiveness and efficiency of our current implementation.

Outdated URIs: In GERBIL, we use the wikipedia API which offers redirects from outdated URIs to the new Wikipedia articles. It is a pretty easy and fast approach. However, Wikipedia offers this only as long as there is no new article that "takes over" the title of the old article.

Answer 10 · 2019-06-07T09:32:49.000Z

Thanks for these caveats, I agree it's not straightforward!

About the availability of the datasets, I will personally not spend any effort annotating a dataset which cannot be distributed freely in its entirety (annotations and underlying text). It's not like there is not enough freely reusable quality text on the web… I know it rules tweets out - then so be it. Ease of reproducibility is just too important for me.