asclepias/asclepias-broker

crossref: scholix format returning less results

Opened this issue · 3 comments

slint commented

The following queries on CrossRef Event API for e.g. the Zenodo DOI prefix give a different number of total results depending on the response format.

  • Investigate what kind of links differ
  • If needed refactor the harvester to use the non-Scholix endpoint

Example queries:

# Default response format from "/v1/events"
curl "https://api.eventdata.crossref.org/v1/events?obj-id.prefix=10.5281&relation-type=references&source=crossref"
{ ... "total-results": 2380, ... }

# Scholix format from "/v1/events/scholix"
curl "https://api.eventdata.crossref.org/v1/events?obj-id.prefix=10.5281&relation-type=references&source=crossref"
{ ... "total-results": 2280, ... }

In the Scholix case despite the payload containing a number of results which equals to 2280 the real number of results seems to be much lower, 257. I can confirm that the ones returned by the Scholix endpoint all match with the ones from the events one, so there seems to be only surplus on the non-scholix side.
After this, it seems that there is a considerable amount of events that we end up not harvesting.

At this point, I think we should proceed with refactoring the harvester to the non-Scholix endpoint.
Example of a missing event:

{
    "license": "https://doi.org/10.13003/CED-terms-of-use",
    "obj_id": "https://doi.org/10.5281/zenodo.153937",
    "source_token": "8676e950-8ac5-4074-8ac3-c0a18ada7e99",
    "occurred_at": "2016-09-19T00:00:00Z",
    "subj_id": "https://doi.org/10.12688/f1000research.9259.1",
    "id": "31871305-1a69-447b-82a0-d27cf1d14a00",
    "terms": "https://doi.org/10.13003/CED-terms-of-use",
    "message_action": "create",
    "source_id": "crossref",
    "timestamp": "2017-05-19T13:30:11Z",
    "relation_type_id": "references"
}

Related PR #72

I think this should rather be reported to CrossRef as a bug (at least it seems like it). The right guy would be @afandian (Joe Wass).