pensoft/OpenBiodiv

Some TODO's for Iteration 2

pdatascience opened this issue · 8 comments

  • add rule to match authors with affiliation string (not only institution ID)
  • figure out why processing of BDJ stopped
  • expand to all Pensoft journals
  • expand to Plazi
  • improve logger with eventType and eventDate
  • zookeys email
  • matching of keywords - lookup problem -> look at the lookup problems below
  • NS troubleshooting when you load both fabio and foaf
  • comas, such as in University of California, berkeley
  • matching statistics from the log
  • SPARQL authors that have been multiple papers
  • SPARQL authors that have the same name, but different ID's
  • SPARQL specific authors such as Lyubomir Penev
  • SPARQL authors that are not members of any organization or have no emails or both

lookup-id changes:

  • instead of minting a new author id at the end of the author disambiguation function, try a final lookup inside the database with lookup_id (+ paper id) which will avoid duplication of authors for the same paper
  • lookup id with rdfs:label and import skos ontology
  • rename lookup_author and lookup_insititution to something else, such as disambiguate... or parse... essentially, the lookup functions are function that do access the remote database, while the disambiguate functions do more complex stuff and use the local db as well
  • have a function to update the local db of authors instead of copying it twice

fix ontology compliance

  • for example move the keywords over to ResearchPaper
  • lookup_authors does not work - does not update local database

for the lookup, but they are linked with OWL:sameAs return a random of the two

Not sure if this is the place to call attention to the frequently cited paper by Halpin et. al "When owl:sameAs Isn’t the Same: An Analysis of Identity in Linked Data" and also Ding et al. "owl:sameAs and Linked Data: An Empirical Study"

I am well aware of this paper, but the issue is here somewhat different. At the moment, as I'm mostly done with the ontology I am generating the Linked Open Data dataset with the help of R scripts in this repository. Among others I have a function that looks up a URI given the label of the resource, lookup_id. This is a reminder to myself to make the function robust in such a way that should there be a owl:sameAs between two URI's, then then function should return only one of them.

More generally speaking I do need to use owl:sameAs in certain scenarios where the resources are trully the same resource, as in the same person for example. One possible scenario would be where I disamgiguate person URI's with the help of Pensoft's relational database.

We will do two passes: first, process the XML's, create URI's for different people, try to disambiguate based on the information found in the XML. Second, ask the relational database to provide additional details about the people (such as emails etc,). Based on this new information, potentially merge nodes.

Your goal is clear and worthy, but if I understand what you think you might do, it seems you would also depend on adding logic based on XML and DBMS semantics. If a third party wants to use your proposed owl ontology to model an LOD data object, and if it hopes to participate in the above mentioned certain scenarios, it seems it would be required also to provide the non-rdf semantics.

I am in the process of writing a paper about this ontology in the Journal of Biomedical Semantics, so yes, I will publish all of the semantics I use. I am not quite sure if I rely on external semantics. What I do is I merge several data-sources into OBKMS and I use the ontology described here for this purpose. The user of the system is only working with the LOD that has been generated through this process. As I cannot guarantee, that I will be able to merge all identifiers at the point of generating the dataset, I do allow owl:sameAs. Then the question becomes when looking something up that potentially has multiple identifiers what to do. It is a cosmetic question really. Because the two solutions are equivalent:

Solution 1: Return both identiefiers, and whatever assertions you do, do them on the both.

Solution 2: Return one of the identifiers, and whatever you do, do it to it.

2 is, however, equivalent to 1, as OWL inference will copy your changes to the other.

I do, want to, however, discuss this and other issues related to the usage of the system, and not to the ontology in a paper, and I will certain keep you in the loop. I will be happy if we could collab on that!