dice-group/gerbil

Several changes on version 1.2

ndcuong69 opened this issue · 37 comments

Performance of annotators seriously dropped and several running error since updating to version 1.2.
Please see the difference in performance of annotators on the D2K task on 2 versions:
http://gerbil.aksw.org/gerbil/experiment?id=201510300001
http://gerbil.aksw.org/gerbil/experiment?id=201511060000
Please explain the changes.
Cuong Nguyen

Hi Cuong Nguyen,

the new version 1.2.0 has a lot of changes. We are not using the BAT-Framework anymore and the whole evaluation process is based on URI matching. Thus, GERBIL is now able to handle entities that are not present in a KB and an annotator might be asked for them. We started to change the documentation in the wiki and will document the changes during the "official" release that we plan for the next days. However, the old version 1.1.4 caused problems on the server and we had to replace it with this new one even it has not been released.

However, we discussed the D2KB task several times and had to realize that the current implementation is not exactly matching the mathematical description. At the moment, GERBIL does not filter named entities that are returned by the annotator but that have not been asked for. Thus, the quality ratings of annotators get worse if the annotator disambiguates more named entities than those named entities mentioned inside the request. We will change this as soon as possible.

We are sorry for any inconveniences that have been created.

Update

The higher error counts are caused by communication errors. Since they might be caused by temporary network problems, only further experiment runs can show whether these errors have been caused by our new version or not.

The main problem should be fixed in version 1.2.1. I repeated your experiment with this version: http://gerbil.aksw.org/gerbil/experiment?id=201511090004

Thank for fast solving.
Please check for task A2KB as well:
http://gerbil.aksw.org/gerbil/experiment?id=201511100000

Hi Micheal,
Can you send me a hint to test my annotator with existing annotators in
Gerbil?
I take a look at:
https://github.com/AKSW/gerbil/wiki/How-to-add-a-new-Annotator
but really do not know where to start.
If I want to develop a NIF web service for my annotator, can you remind me
a starting point?
Thank in advance.
Cuong

On Tue, Nov 10, 2015 at 4:22 AM, Michael Röder notifications@github.com
wrote:

The main problem should be fixed in version 1.2.1
https://github.com/AKSW/gerbil/releases/tag/V1.2.1. I repeated your
experiment with this version:
http://gerbil.aksw.org/gerbil/experiment?id=201511090004


Reply to this email directly or view it on GitHub
#98 (comment).

Hi Cuong,

this is a good question.

There is a branch in which I created an example NIF-based web service. It is a very simple example which wraps a client of DBpedia Spotlight. However, it might be helpful to see how a such a web service might work.
The important part of this example is this class: https://github.com/AKSW/gerbil/blob/SpotWrapNifWS4Test/src/main/java/org/aksw/gerbil/ws4test/SpotlightResource.java
It contains the parsing of the request, and the generation of the response. The important steps are the following.

        Reader inputReader;
        // 1. Generate a Reader, an InputStream or a simple String that contains the NIF
        // send by GERBIL
        // 2. Parse the NIF using a Parser (currently, we use only Turtle)
        TurtleNIFDocumentParser parser = new TurtleNIFDocumentParser();
        Document document;
        try {
            document = parser.getDocumentFromNIFReader(inputReader);
        } catch (Exception e) {
            LOGGER.error("Exception while reading request.", e);
            return "";
        }
        // 3. use the text and maybe some Markings sent by GERBIL to generate your Markings 
        // (a.k.a annotations) depending on the task you want to solve
        // 4. Add your generated Markings to the document
        document.setMarkings(new ArrayList<Marking>(client.annotateSavely(document)));
        // 5. Generate a String containing the NIF and send it back to GERBIL
        TurtleNIFDocumentCreator creator = new TurtleNIFDocumentCreator();
        String nifDocument = creator.getDocumentAsNIFString(document);
        return nifDocument;

Note that this example uses our gerbil.nif.transfer library. The Markings like named entities that can be added to the document can be found here.

Does this answers your question or do you need further assistance?

It is helpful. Thank.

On Tue, Nov 10, 2015 at 7:44 PM, Michael Röder notifications@github.com
wrote:

Hi Cuong,

this is a good question.

There is a branch in which I created an example NIF-based web service. It
is a very simple example which wraps a client of DBpedia Spotlight.
However, it might be helpful to see how a such a web service might work.
The important part of this example is this class:
https://github.com/AKSW/gerbil/blob/SpotWrapNifWS4Test/src/main/java/org/aksw/gerbil/ws4test/SpotlightResource.java
It contains the parsing of the request, and the generation of the
response. The important steps are the following.

    Reader inputReader;
    // 1. Generate a Reader, an InputStream or a simple String that contains the NIF
    // send by GERBIL
    // 2. Parse the NIF using a Parser (currently, we use only Turtle)
    TurtleNIFDocumentParser parser = new TurtleNIFDocumentParser();
    Document document;
    try {
        document = parser.getDocumentFromNIFReader(inputReader);
    } catch (Exception e) {
        LOGGER.error("Exception while reading request.", e);
        return "";
    }
    // 3. use the text and maybe some Markings sent by GERBIL to generate your Markings (a.k.a annotations)
    // depending on the task you want to solve
    // 4. Add your generated Markings to the document
    document.setMarkings(new ArrayList<Marking>(client.annotateSavely(document)));
    // 5. Generate a String containing the NIF and send it back to GERBIL
    TurtleNIFDocumentCreator creator = new TurtleNIFDocumentCreator();
    String nifDocument = creator.getDocumentAsNIFString(document);
    return nifDocument;

Note that this example uses our gerbil.nif.transfer
https://github.com/AKSW/gerbil/releases/tag/gerbil.nif.transfer-v1.2.0
library. The Markings like named entities that can be added to the
document can be found here
https://github.com/AKSW/gerbil/wiki/Document-Markings-in-gerbil.nif.transfer
.

Does this answers your question or do you need further assistance?


Reply to this email directly or view it on GitHub
#98 (comment).

Hi Michael,

First of all congratulation and thank you for the huge work done for this version. Really appreciated.

Anyway I have noticed something strange. I saw a significant drop in performance of WAT in D2KB due to the new version. Using the older version I can get around 80% F1 across several datasets, while now the performance dropped significantly to about 60%. Do you know if this might be caused by an error/regression in the code? Can you explain how the metrics for the D2KB experiment are now calculated?

Hi Francesco,

thank you very much! :)

In general, the new evaluation is a little bit more tough than the one of the BAT-Framework. While in the old versions unkown entities (those that where not part of the Wikipedia) have been discarded by the Wikipedia-ID-based BAT-Framework, we are now able to take those entities into account. We implemented an knowledge base agnostic approach for URI matching that is described here.
Please take a look at the EE Micro F1 score column in this experiment http://gerbil.aksw.org/gerbil/experiment?id=201511100004 where you can see that WAT couldn't disambiguate one of those unknown/emerging entities.

However, if the results of the InKB columns are differing very much from the results of the old version, I think we should take a closer look. Thus, I compared the results for the KORE50 dataset, locally.

version 1.1.4
micF1=0.5899280309677124
micPrecision=0.6074073910713196
micRecall=0.5734265446662903
macF1=0.531333327293396
macPrecision=0.5746666193008423
macRecall=0.5296666622161865
errors=0

version 1.2.1
micF1=0.6039215686274509
micPrecision=0.6936936936936937
micRecall=0.5347222222222222
macF1=0.5065396825396826
macPrecision=0.5386666666666666
macRecall=0.4863333333333334
errors=0

As you can see the results are only slightly different.

Can you give a particular example for which WAT has performed better in the older version? I would preferr a small dataset since for small examples I can compare the evaluations manually.

Cheers,
Michael

I added a Wiki page explaining the D2KB task using an example. Please let me know whether this explanation is helpful.
Especially the last section might be interesting for A2KB annotators like WAT.

Do you know any package for DotNet to process NIF as TurtleNIFDocumentParser
in Java?

On Wed, Nov 11, 2015 at 7:53 PM, Michael Röder notifications@github.com
wrote:

I added a Wiki page explaining the D2KB task
https://github.com/AKSW/gerbil/wiki/D2KB using an example. Please let
me know whether this explanation is helpful.


Reply to this email directly or view it on GitHub
#98 (comment).

You might want to have a look at the web page of the NLP2RDF group ( http://site.nlp2rdf.org/ ) that is working on NIF. Maybe you can find a pointer to a NIF library for DotNet.
Otherwise, you can still stick to a "normal" RDF package to read and write NIF.

Sorry for the delay, here you can see the performance using version 1.1.4. As you can notice there is a more than 20% drop in performance.

d2kb

The new results are consultable from here http://gerbil.aksw.org/gerbil/experiment?id=201511110001

Thanks for the result tables. Yes, there seems to be a difference between the micro F1-score of version 1.1.4 and the InKB micro F1 score of version 1.2.1.
I will use one of the shorter datasets for a deep comparison of the two evaluations to make sure that I didn't missed a detail or a special case. However, this will take some time. I might have an answer on monday.

Same problem for my approach. The new InKB micro F1 does not match the old micro F1 score at all.
Is there already a short documentation what Micro-F1, In Kb Micro-F1 and EE Micro-F1 are all about?

Over the weekend, I compared the evaluations of the two GERBIL versions using examples from the MSNBC dataset and the annotations of WAT. Beside a minor bug (#103) I only found the already known differences.

It seems that the problem is mainly caused by the disadvantage of our new approach described in this section. Since the BAT-Framework mapped every URI into a Wikipedia ID, URIs with typos or outdated URIs have been identified and removed from the gold standard automatically.

In contrast, our current implementation trusts the gold standard blindly. Therefore, our approach is to update the URIs inside the datasets within a small student project. However, this will take some time (our current planning is to have updated datasets in June 2016).

It could be argued that adding a check whether an URI does exist or not is possible. While this sounds like a practical solution, I am not convinced that it is really a good and general solution. Especially if we think of GERBIL as being knowledge-base-agnostic and not all knowledge bases are available through http.
What do you think?

@quhfus
InKB Micro F1-score is the Micro F1-score calculated on entities that have been identified as being part of a well-known knowledge base (like DBpedia). EE Micro F1-score is calculated on so called emrging entities which are entities that are not present inside the knowledge base.

I understand your point regarding the URI matching (which is one of the possible idea to be KB agnostic), but the idea of refreshing the datasets every now and then (especially by hand) in my opinion doesn't seem to be a good solution. (Un)fortunately Wikipedia, which the KB used by most of the datasets (if not all) is going to evolve and change on a daily basis, therefore forcing a dataset to be valid only on a specific temporal point seems a very limiting solution, specifically if we want to keep using Gerbil as benchmark system to compare up-to-dated annotation systems.

Probably the best way to cope with the problem would be to add an additional temporal dimension (or temporal instantiation) to each dataset. Whenever a user starts a new annotation the system will try to update the golden annotations of each user-selected dataset in the same way the BAT framework did. If no updates are made and the results are in cache you can directly show to the user the results from database, otherwise a new annotation process will take place with the up-to-date dataset.

I totally agree your point regarding the daily changing Wikipedia and the need for datasets that are up-to-date. However, I would adapt the work flow a little bit.

  • GERBIL could offer an interface for an EntityChecker that simply checks whether an entity with a given URI exists.
    • An HTTP based implementation could do this via an HTTP HEAD request.
      • We could add instances of this class for KBs that can be accessed via HTTP like Wikipedia or DBpedia.
      • For reducing the latency a simple cache could be used which contains URIs and the HTTP response status (and their age).
    • With this approach we are still open to other ways to implement the entity checking, e.g., on a locale KB.
  • During the loading procedure of a dataset, we could check all entities using the EntityChecker instances that can be defined in the gerbil.properties file.
  • If an entity has been identified as not present any more, it is still part of the gold standard (since it is still an entity) but it gets a generated URI and will be classified as emerging entity.

What do you think?

@nopper @MichaelRoeder I'm not an user of Gerbil, however I follow the project as I work on some related issues, like this one about outdated identifiers in the annotated datasets.

As we rebuilt our entity models on newer versions of dbpedia/wikipedia the identifiers would be rendered useless. We created an abstraction on top which abstracts topic identifiers, and keeps a record of all possible past identifiers for a topic. Because we are processing wiki quite often they are kept up to date.

Hi all, Micha and me are thinking about this issue constantly and we will have a more sophisticated suggestion by end of Q1 2016. Up until then, I hope everyone of you is happy with the state of GERBIL. Are you guys ok with closing this issue, or is there anything else to discuss? We are happy to help you.

Hey. Is there any possibility to let the "old" bat-framework version run on a server until the updated datasets are installed (maybe other domain and marking as deprecated)? Does this cause any significant circumstances? I ask this because I have used the framework for my PhD experiments so far, and my results are not consistent any more. I use a a relatively new version of Wikipedia and a couple of entity URIs differ to those used in the datasets.

I think Ricardo raised an important point. It may take some time to implement the idea mentioned above because GERBIL is not funded and our resources are limited. Sorry for this 😟

@dav009 That sounds interesting. Unfortunately, I didn't got the idea (Sorry). Do you suggest to use the topics of your model to identify resources that got another URI over the time?
Or is anybody aware of a list containing resources and their old and new URIs or Wikipedia articles and their old and new titles?

@quhfus That is not easy but not impossible 😉
Let's discuss this per mail. You can find my address on this page.

@RicardoUsbeck I would recommend to let this issue as it is. Otherwise, another user of GERBIL could think that we are not aware of the changed behaviour and start a completely new discussion.
And on the other hand, it might push us to get the implementation done 😉

Hi all and thanks for the great discussion. To be honest I am not entirely sure about about the HEAD HTTP solution. What I fear is that sometimes you cannot avoid using page IDs to successfully complete the refresh process. Say for example that there is a mention in an old dataset of 2012 which links to the entity President_of_France through the URI http://en.wikipedia.org/wiki/President. But now assume that in the current wikipedia the URI http://en.wikipedia.org/wiki/President is used as a disambiguation page or used as a redirect or even worse is used to indicate the president of the United States. Using the HEAD predicate would not detect this kind of topic drifts inside the wikipedia knowledge base, because President was existing before and it is still existing now in Wikipedia, but it now refers to completely different entity. The only way to get rid of this is to use page IDs which are guaranteed to be fixed.

I also agree with @dav009 , that is maybe it's better for the moment to keep a legacy version of the benchmark (say 1.2.x version) running on a different server / URL.

As a starter, I think each annotated dataset should mention the wikipedia version used in the annotations.

@nopper I totally agree with you, that the HEAD HTTP approach is not the ideal solution. I presented it as a fast and easy way to identify annotations that are clearly outdated.
Thus, we still want to update the existing datasets during a small project (if we can get a student for that 😉 ).
Regarding your example, I agree that this is possible and that in this special case the usage of Wikipedia IDs might have prevented this topic drift. However, I see two points that should be taken into account regarding this case.

  1. I think that this happens very rarely (please correct me if I'm wrong) and that the datasets contain a lot of other types of errors that occur more often and that we should care about before thinking about such a special one.
  2. We already had this Wikipedia ID based system (the BAT-Framework) which had (from our point of view) more disadvantages than a URI based system. But you might be interested in the WikiData project. I think that they are not changing the URIs of entities over time since their URIs contain no semantics but a simple ID, e.g., https://en.wikipedia.org/wiki/President owl:sameAs https://www.wikidata.org/wiki/Q30461. Thus, it might be interesting to use these URIs for annotations.

Keeping a legacy version is possible. But since 1.1.4 is not running stable any more (the reason why we had to deploy the new version), we would have to invest additional time to develop a new stable version out of 1.1.4. Additionally, we wouldn't be able to offer the archiving and the keeping of URIs for this legacy version. Thus, this does not sound like a good solution to me.

@dav009 Yor are right and we already made some steps in that direction. The datasets of the N³ collection contain this meta data. Unfortunately, there is no standard how this can be expressed. It would be nice to have a URI for a certain version of the Wikipedia or the DBpedia. But such URIs are not existing. Thus, we added a String containing the DBpedia version we used for the annotations.

<http://aksw.org/N3/Reuters-128/101#char=94,124>
      a       nif:RFC5147String ;
      nif:anchorOf "Nippon Telegraph and Telephone"^^xsd:string ;
      nif:beginIndex "94"^^xsd:nonNegativeInteger ;
      nif:endIndex "124"^^xsd:nonNegativeInteger ;
      nif:referenceContext
              <http://aksw.org/N3/Reuters-128/101#char=0,484> ;
      itsrdf:taIdentRef <http://dbpedia.org/resource/Nippon_Telegraph_and_Telephone> ;
      itsrdf:taSource "DBpedia_en_3.9"^^xsd:string .

@MichaelRoeder yeah, there is no canonical url. I've been relying on the wikipedia dump date as an identifier. As I assumed dumps to be static.

On a further thought wikipedia dumps seem to be removed within a year ( https://dumps.wikimedia.org/enwiki/ only contains 2015 dumps)

@nopper @MichaelRoeder Most of cases Ive found are titles becoming disambiguations. i.e: Leninsk-Kuznetsky the reason is the URI carries a semantic with it.

I think QIDs (Wikidata) would def make this better since in that case:

  • Leninsk-Kuznetsky would initially get QID1 (lets say the city sense was initially described in the wiki article)
  • When wikipedia finds out it, it is an ambiguous title and it should be a disambiguation, Wikidata would probably:
    • create a QID2 (for the city sense)
    • create a QID3 (for the district sense)
    • make QID1 point to QID2.

So a QID1 would still be resolvable.

Not sure if I'm missing any important bit here :)

After installing the Gerbil on my local Windows system and fixing some
errors in Unix-style text-file reading, I take the deep look into the codes
and have some discovers:
The URI matching is based on relations owl:SameAs when querying "
http://dbpedia.org/page/entity_name" in
class org.aksw.gerbil.semantic.sameas.HTTPBasedSameAsRetriever.
The solution for our discussed issues can be:

  • for "Re-directing Wikipedia name": use relation dbo:wikiPageRedirects in
    the same return page
  • for "Checking existing of Wikipedia name": when querying a non-existing
    entity, DBpedia.org return its own 404 file, detect it and we know the
    existance of the queried entity. It is often return the error: "Couldn't
    find an RDF language for the content type header value..." but I am sure
    about this.
    Regarding the daily change of Wikipedia, the gold sets only cover a very
    small part of Wikipedia, so there are very small probability, they are
    frequently changed. And the change has to be updated to DBpedia, where
    provides the entity information. Therefore, any experiment just reports the
    executed time with the DBpedia version.
    Regarding data sets, I am working mainly four datasets ACQUAINT, ACE,
    MSNBC, IITB.
    Class MSBNCDataset is used to load the first threes and it discard all
    annotations with chosen annotation as "null"
    However class IITBDataset, which is used to load IITB, accepts all
    annotation with empty string as non-KB entity. In IITB, there are
    9184/19712 (=46.6%) annotations with empty-string wikipedia page name.
    These entities cannot be entity ids in any other KBs. So please remove them
    when loading the IITB data set.
    Re-directing Wikipedia name and empty-string are two painful problems in
    testing. Whenever you try to make good annotation, you got a wrong
    judgement on your right annotation.

On Fri, Nov 27, 2015 at 4:46 AM, David Przybilla notifications@github.com
wrote:

@nopper https://github.com/nopper @MichaelRoeder
https://github.com/MichaelRoeder Most of cases Ive found are titles
becoming disambiguations. i.e: Leninsk-Kuznetsky the reason is the URI
carries a semantic with it.

I think QIDs (Wikidata) would def make this better since in that case:

  • Leninsk-Kuznetsky would initially get QID1 (lets say the city sense
    was initially described in the wiki article)
  • When wikipedia finds out it, it is an ambiguous title and it should
    be a disambiguation, Wikidata would probably:
    • create a QID2 (for the city sense)
    • create a QID3 (for the district sense)
    • make QID1 point to QID2.

So a QID1 would still be resolvable.

Not sure if I'm missing any important bit here :)


Reply to this email directly or view it on GitHub
#98 (comment).

First of all, I want to thank Chuong for his effort looking into the details of GERBIL's evaluation. He raised several points in his last post.

using dbo:wikiPageRedirects

Let's assume we have http://en.wikipedia.org/wiki/entity_old in our dataset and we want to map it to http://dbpedia.org/resource/entity_new.

I already tested the solution Chuong explains and it is not working as expected. The class org.aksw.gerbil.semantic.sameas.HTTPBasedSameAsRetriever retrieves the RDF model of an entity by dereferencing its URI. Thus, it gets the model for http://dbpedia.org/resource/entity_name which is different from the information that you get if you open this URI with your browser getting redirected to http://dbpedia.org/page/entity_name. The main difference is that the RDF model does not contain triples in which the entity is the object. However, the triple that we would like to have is always formulated in the following way.

<http://dbpedia.org/resource/entity_old> dbo:wikiPageRedirects <http://dbpedia.org/resource/entity_new> .

Thus, the model does not contain these triples.

The other way around (requesting the model of http://dbpedia.org/resource/entity_old and searching for that triple) is not working as well, because DBpedia returns an error for these old resources (even if they are linked on the page http://dbpedia.org/page/entity_new).

Retrieving the pages of DBpedia entities searching for dbo:wikiPageRedirects would be another way to access them, but it does not seem to be a good solution since we would be bound to the format of an HTML page instead of using established standards like RDF models or APIs.

I extended the concept of the sameAs retrieval towards different classes that full fill different tasks. Beside the HTTP based dereferencing of a URI there are now classes that

  • transform DBpedia URIs in Wikipedia URIs (and vice versa)
  • can ask the Wikipedia API for redirects (which handles exactly the case described above)
  • handle errors in URIs, e.g., correcting the wrong domain en.dbpedia.org returned by some systems
  • handle the encoding and decoding of URIs

You see that we havn't been lazy during the last weeks. However, the new version replacing the criticised version 1.2.1 is not ready until now.

Checking existence if Wikipedia article names

This is an interesting way to check the existence of these articles. However, with such an implementation we stick to a special behaviour of the current DBpedia. I think using the established Wikipedia API is a better way, isn't it?

Daily change of Wikipedia

I agree with you that the probability of an entity URI being affected by the daily changes of Wikipedia is very low. And I understand that for those users of GERBIL that are using DBpedia the daily Wikipedia changes are no problem.

However, we have to remember that not all annotation systems have been designed to rely on DBpedia but are using Wikipedia as knowledge base. And since we would like to make GERBIL knowledge base agnostic, we would like to develop a solution for that.

Additionally, the datasets are getting older and instead of focussing on the probability of a URI changing during the next days, we should think about the probability that an entity will have the same URI during the next years. During the work on this issue we already found entities that have been affected by URI changes.

URIs with null

While strings like null or *null* are not valid URIs, GERBIL will keep these named entity annotations. Thus, it would be a bug if the MSNBCDataset would discard them (I just added this to the JUnit test and the class didn't discarded them).

In our understanding of the task, these entities are still valid and contain an important information. We know, that there is a named entity inside the text at a certain position and we know that it can not be linked to one of the established knowledge bases. Thus, your annotator should return exactly this information by generating a URI for this entity that does not have the namespace of an established knowledge base. Please take a look at the URI matching and the examples shown there. The null URIs behave exactly like the http://aksw.org/notInWiki/Berlin URI in the example.

@ndcuong69 Why do you think that Wikipedia name redirecting as well as empty Strings are two painful problems? Can you give an example in which your annotator returns the correct URI that should be matched to the URI of the gold standard but is marked as wrong by GERBIL?

MichaelRoeder: I reformatted this post of ndcuong69 and added the other three so this discussion becomes easier to follow. Please look at the following post of MichaelRoeder to see the content that ndcuong69 wrote here.

Formatted Post from @ndcuong69

using dbo:wikiPageRedirects

In current Dbpedia, if you query http://dbpedia.org/resource/entity_name, it will redirect to http://dbpedia.org/page/entity_name.

In class HTTPBasedSameAsRetriever, Gerbil retrieve the SameAs triple from lines (in the returned html file):

<li><span class="literal"><a class="uri" rel="owl:sameAs" href="http://rdf.freebase.com/ns/m.01zj1t"><small>freebase</small>:Home Depot</a></span></li>

in the same returned file, there are lines as:

<li><span class="literal"><a class="uri" rev="dbo:wikiPageRedirects" xmlns:dbo="http://dbpedia.org/ontology/" href="http://dbpedia.org/resource/Expo_Design_Center"><small>dbr</small>:Expo_Design_Center</a></span></li>

It is nearly equivalent to:

<http://dbpedia.org/resource/entity_new> dbo:wikiPageRedirects <http://dbpedia.org/resource/entity_old> .

This is the reversed-order triple.

My system parsed the redirect data (downloaded from Wiki or DB) to store all triples and efficiently retrieve the triple (entity_old, to, entity_new).

There are 2 different ways now:

(1) Stick to the structure of the current system: query (old --> new) without storing data:
Query http://dbpedia.org/resource/entity_old
Look at: <body onload="init();" about="http://dbpedia.org/resource/The_Home_Depot"> the "Title" of the returned page is wrong!!!
--> done.

(2) Storing all triples (old, to, new) and query anytime you need
information.

I prefer this way because of some of academic reasons:

Relation owl:SameAs represents the relationship between entities from:

  • different KB: Yagoo vs. Wiki, etc.
  • different languages: Home Depot in English or French...

Relation "redirect" is used to represents the relationship between:

  • "old entity" --> "new entity": because each Wiki page is considered as one entity, so "The_Home_Depot_U.S.A." can be the "old" name of the new entity "The_Home_Depot"
  • wrong-name entity --> right-name entity: "Home_Depot_U_S_A,_Inc." to "The_Home_Depot"
  • other-name entity --> original entity: "Home_Depot", "The Home Depot, Inc." --> "The_Home_Depot". Remember that all three names are valid (see the content of Wiki page: "The_Home_Depot")

The 3rd relation is the most valuable usage of the "redirect" relationship because it represent the "Polysemy": several names link to the same entity. It has the similar role as "SameAs", but inside one KB.

Problem without using Redirecting in Entity Linking:
In MSNBC, file Bus16451112.txt, several annotations as:
(start char index, end char index, surface form, my output, chosen annotation)

69|79|Home Depot|The_Home_Depot|Home_Depot     
978|990|A.G. Edwards|A._G._Edwards|A.G._Edwards

Notice that my output is the page that has the true content and the chosen annotation is a redirecting name. Forcing the system return one of redirecting names that is similar to the surface form is not a right thing to do!

Number of redirected chosen annotation in MSNBC: 115/755
I will collect the number on other data sets.

URIs with null

I understand your idea about "notInKB", but we are not sure about the intend of annotators at the time to create the gold data set.

In my opinion:

  • the idea about "notInKB" is not available yet at the time to create data sets.
  • the "empty string" is valuable for the task "mention detection" but not for the task "disambiguation in Entity Linking"

I take some examples in IITB data sets:
file 13Oct08AmitHealth1.txt (start char index, end char index, surface form, my output, chosen annotation)

278|285|results|Result||0                                
97|103|hectic|Hectic||0
215|227|the greatest|nil||2
593|599|fluids|Fluid||0
1497|1510|energy levels|Energy_level||0
1249|1255|intake|Intake||0
1401|1408|reasons|Reason||0
4301|4311|guidelines|Guideline||0

but in the same file:

6071|6077|intake|Intake|Nutrition|0

Checking existence of Wikipedia article names

This is an interesting way to check the existence of these articles.
However, with such an implementation we stick to a special behaviour of the
current DBpedia. I think using the established Wikipedia API is a better
way, isn't it?

DBpedia provides the service for Gerbil through Virtusos.

If wanting to use Wikipedia: parsing Wiki dump --> store to db --> query
whenever you need the info. My system is doing this way. If only using the
some basic information from Wiki, it only takes a short time to implement
it.

If you are interesting on it, install a MySQL, parse the Wiki dump to put
the info to the db server and provide an in-house service for Wiki data.

The problem is it should be updated whenever there is a new Wiki dump.

An extra problem: Chosen annotation as an ambiguous name

An ambiguous page should not be used as a chosen annotation. A disambiguous
page can be discovered by querying Dbpedia or querying Wiki data.

using dbo:wikiPageRedirects

You are completely right regarding the HTML pages of the resources. However, it seems that there is a general misunderstanding. We don't want to parse an HTML page since DBpedia might change it (even slightly) and could break our algorithms. That's why we want to stick to APIs that are better specified than a costume HTML page, i.e., the Wikipeda API and requesting RDF from servers.

The HTTBasedSameAsRetriever asks the DBpedia not for HTML but for RDF data. In that case, the DBpedia server has a different behaviour than your browser.

We agree that from an academical point of view, sameAs and redirects are two different properties. However, in our use case we can handle both in the same way. Maybe we should rename the SameAsRetriever interface (and all the other classes) to make clear that we are not only aiming at sameAs links.

We totally agree that the redirect relationship is important and that there are entities in the datasets for which we will need this feature. But we thought that it is clear that we already developed the usage of redirects for the current version and that we have already extended it for the version 1.2.2.

URIs with null

We disagree in both points.

  • We think that an entity inside a gold standard that has no URI attached to it, has only one possible interpretation - there is an entity but it does not fit to any known URI. Every other interpretation would lead to the deletion of the entity from the gold standard and this would have to be done before publishing it.
  • We define that an "empty String" is an empty URI. In GERBIL an empty URI should behave exactly like the *null* URIs and contain the same information that I explained in my last post. Thus, they are valuable for the D2KB task.

We think the difference between our two positions is that you say, we should delete (empty) URIs that can not be mapped to the DBpedia. But we think that these "notInKB" entities are important for the future development of annotators. From our point of view, the examples that you showed are not an argument against the "notInKB" idea but simply say that the gold standard itself is outdated and should be updated. As described in one of my former posts, we will have a small internal project that will exactly aim at this problem. Thus, it should be fixed in the future.

@nopper, @quhfus what is your opinion about that?

Checking existence of Wikipedia article names

Using DBpedias virtuoso for checking the existence of Wikipedia entities would be possible. However, there are two arguments against it.

  1. The DBpedia update cycles are very slow. Thus, the Wikipedia might have changed while the DBpedia wouldn't have recognized this change. This would be a drawback for annotation systems that rely on the up-to-date Wikipedia as KB
  2. DBpedia URIs are translated into Wikipedia URIs and vice versa in version 1.2.2. The DBpedia URIs are checked by using the DBpedia. Thus, using the Wikipedia API is simply a parallel approach that helps us to avoid problems, e.g., a DBpedia that is reachable.

A local Wikipedia would make the system faster. However, users that are using GERBIL locally would have to download the Wikipedia dump which would be a clear drawback. Additionally, we would loose the up-to-date feature of the Wikipedia API. And as you already wrote, a user would have to download new dumps when they are published.

An extra problem: Chosen annotation as an ambiguous name

We totally agree with you. This would be an error in the gold standard.

Parsing entity query on DBPedia - Virtusos

When you use a query as "http://dbpedia.org/resource/entity_name", depending on the data from DBpedia, Virtusos will generate the corresponding dynamic html as:
(1) If the entity is exist, return the page:

  • entity name in < body onload="init();" about="http://dbpedia.org/resource/The_Home_Depot" > , not in the title
  • A set of triple. Remind you that the relation " rev="dbo:wikiPageRedirects" " is wrong: "rev" instead of "rel". Maybe this error comes from the codes of jsp pages.

(2) If the entity is not exist, 2 sub-cases:

(2a) If the entity is an re-directing entity, return the re-directed pages --> retrieving the mapping redirecting entity --> entity

(2b) If the entity is not in DBpedia, return the error page, with the Property part as "No further information is available. (The requested entity is unknown)"

(3) If the entity is an ambiguous page, return a page with triple "dbo:wikiPageDisambiguates"

Why do you only want stick to route (1) without considering other routes? All the routes are the behaviors of Virtuso. If the absent of "academic triple" is the reason, I think it is not reasonable.

I have just done a class to check all 4 routes on the triple-form returned page (not html). Testing MSNBC on the local 1.2.1, f1-score is approximate to 1.1.4.

URIs with null

I suggest that if we can found or create a "good" gold set for "notInKB", we use it.
IITB have nearly half of annotations with empty URI. If it is outdated (as you mean) so much, why not only testing on the "updated" part of the gold set?
By the way, if we do not agree on this sub-topic, I will check the performances of algorithms on the local Gerbil. But I always hope we can play in the same ground.

Checking existence of Wikipedia article names

If you like to have an Wiki-api service that can provide updated information, why not create a service for the community. It is not very difficult.
Needs:

  • 1 db server
  • 1 hosting place
    I can support to parse the Wiki dump to extract wiki title + id, redirect info, ambiguous pages. Just import extracted data to the db server. Write a service to export triples.

Time to process a dump: around 5 hours, depending on the computing power.

Parsing entity query on DBPedia - Virtuoso

We understand your approach of interpreting the error pages of Virtuoso and parsing the HTML page that is generated by it. However, it seems that you misunderstood an important point. As described before, we would like to make GERBIL as knowledge base agnostic as possible. Thus, the solution you are suggesting has major disadvantages, because it relies on a certain behaviour of a single company's product (the way how Virtuoso generates an HTML page for an entity and its error pages).

  • This behaviour might be changed in the future without announcing it.
  • It is not possible to request information from other knowledge bases that are not using Virtuoso.
  • Different DBpedias might use different versions of Virtuoso that might have a different behaviour.

Please note that - as described above - we already implemented the routes you have explained in another, more general way. Our approach relies on a standardized way of dereferencing entity URIs that is supported by other triple stores/knowledge bases as well and for which a future change is unlikely. Additionally to that, we use the Wikipedia API for which we get mails that announce future changes giving us the possibility to react before the changes are deployed. Thus, from our point of view there is no need to change our implementation.

Checking existence of Wikipedia article names

We won't create such a webservice because it already exists and - as said before - we are already using it. Please take a look at https://www.mediawiki.org/w/api.php

Parsing entity query on DBPedia - Virtuoso

The routes are explained in the html form but my current solution is based on the triple returned from Virtusos similar as HTTPBasedSameAsRetriever. Almost routes are returned in a triple form when you query DBpedia.
The current problem is HTTPBased service is slow when you want to have several tests.
For examples, when testing IITB data sets (with more than 100 files) on 4 algorithms, the data set is loaded 4 times with 4 parallel checking. It takes a lot of time to calculate the results.

Checking existence

I take a look on MSNBCDataSet and not yet to see your checking of existence.

By the way, I have taken all needed data for my paper (on D2KB and A2KB). I would like to take a chance to thank the Gerbil's team, especially Michael for his supports.

Parsing entity query on DBPedia - Virtuoso

You are right, the HTTP based retrieval is slow at the moment. This is one of the drawbacks of these features.

In the current SNAPSHOT of 1.2.2 we fixed the problem that a dataset is loaded several times (issue #66 ). It is only loaded and checked once. Additionally, we added caching for the sameAs retrieval as well as the entity checking.

However, we have to be gentle to the knowledge base servers. Thus, in this new version GERBIL will make sure that it pauses between requests. Otherwise we could be blocked because of a rude bot-like behaviour.

Checking existence

The entitiy checking is implemented in the version1.2.2 branch. Inside the class DatasetConfigurationImpl you can find this peace of code that performs the sameAs retrieval and afterwards the checking:

        List<Meaning> meanings;
        for (Document document : instance.getInstances()) {
            meanings = document.getMarkings(Meaning.class);
            if (retriever != null) {
                for (Meaning meaning : document.getMarkings(Meaning.class)) {
                    retriever.addSameURIs(meaning.getUris());
                }
            }
            // check the meanings
            entityCheckerManager.checkMeanings(meanings);
        }

We have not released version 1.2.2 until now, because we want to make sure that we fix nearly all the problems that have been reported for 1.2.0 and 1.2.1.

Thanks! We are glad we could help. 😃

Hi all,

As far as I understand, in G 1.2.1 compared to 1.1.4, the performance drop of around 10-20% for most of the systems on most of the datasets is caused by the fact that gold labels in datasets use Wikipedia 2011-2012, while most of the annotators use Wiki 2014-2015. Thus, the following two situations happen on Gerbil's side:
a) an URI that was a valid entity in Wiki 2011 becomes a disambiguation/list page in Wiki 2014, and, thus, cannot be found by any annotator, while Gerbil expects it to be a valid entity.
b) an URI/WikiID that was valid in Wiki 2011 is no longer existing in Wiki 2014

Manual correction of gold labels:
You wrote that a potential solution to this problem is to employ a manual correction of these labels. However, can't this be done automatically in the following way?

  • map the URI of each gold label to its Wiki 2011 ID
  • check if the same Wiki ID exists in Wikipedia 2014/2015. AFAIK, Wiki IDs are meant to not be modified over time (which is not the case for URIs). If this exists, replace the gold label URI with the corresponding Wiki 2014 URI. If it does not exist, assign an outOfKB entity to it.

Wikpedia timestamp
I agree with you that it's ideal to not depend on a specific version of Wikipedia, but, unless the Wiki ID is not used, this is quite impossible using URI matching because of situations a) and b) described above (see also @nopper 's example above). In different versions of Wikipedia, the same Wiki URI can point to different entities, disambiguation pages or no-longer existing entities , and I don't understand how this can be solved without looking at the Wiki ID or Wiki version, solely based on URI matching.

In summary, the same URI can have different meanings in different KB versions. If Gerbil wants to use the latest version of the KB, then I believe it should make this clear in the URI (see solution proposed below).

KB agnostic
What does KB agnostic exactly mean ? Gerbil asks annotators to return a dbpedia URI, so an annotator has to rely on one single specific KB (dbpedia in this case) when performing any of the Gerbil tasks. For me, KB agnostic would mean that there is a unified representation of all KBs and of all their versions, i.e. owl:sameAs defines equivalence classes between different KBs and different versions of the same KB.

Solution proposed
I would suggest the following solution to this problem:

  • each URI of an entity should contain either the ID of that entity in the KB used by Gerbil (dbpedia currently) or the string title concatenated with the KB version used
  • for the same entity, all its corresponding URIs from different versions of the KB or from different KBs should be added to owl:sameAs

For example, assume that the entity Barack Obama has Wiki ID 1234 , URI https://en.wikipedia.org/wiki/Barack_Obama in Wiki 2015, URI https://en.wikipedia.org/wiki/President in Wiki 2011 and URI https://www.freebase.com/m/02mjmr in Freebase. Then, owl:sameAs should contain the following URIs:

Thus, it would be very clear that the URIs https://en.wikipedia.org/wiki/v2011/President and https://en.wikipedia.org/wiki/v2015/President represent two distinct entities, which is not the case in the current Gerbil 1.2.1.

In this way, annotators can be ID based or URI based as they prefer, and each annotator can choose what version of the KB it uses, without being constrained to a specific version or to a specific KB (like currently it is the case with Gerbil that requires dbpedia annotations).

The equivalence relation between different KBs and different versions of the same KB has anyway to be done, either using an heuristic approach or, more clean I would say, by just considering all URIs of the same entity in all KBs used and all their versions.

Hi all,

I closed this issue since the discussion about v1.2.0 and v1.2.1 is a little bit outdated. The current behaviour of v1.2.2 and the new released v1.2.3 is described in the wiki of this project or here.
If you have questions regarding this workflow, feel free to write a mail or open a new issue (if you think that more people might be interessted).

Thanks for this discussion 😄

Cheers,
Michael