TomDeneire/pictor

Europeana

Opened this issue · 9 comments

There are currently two versions in data/harvester, one for the OAI-PMH server and one for the SPARQL endpoint, but neither works.

With OAI-PMH I got stuck at the point of getting old identifiers for the records and with SPARQL I'm just not able to locate IIIF information.

Should be possible though...

Mental note, I could reconnect with Jolan Wuyts (https://pro.europeana.eu/person/jolan-wuyts) about this.

Hi Tom! The SPARQL API is being updated currently, once that's done you should be able to use that for your goal. I can't help you on your issue with the OAI PMH I'm afraid, but I can show you how to list all of the objects that have a IIIF manifest using the Search and Record API.

here's that call on europeana.eu: Link

here's that same call on the Search API: Link ,add your API key

you can use facets to get the individual URLs, like so: API call for the first 4000

to get the next x amount after those first 4000, you can use f.[FACET_NAME].facet.offset to continue after your offset.

This is a bit hacky, granted, but I think it's the quickest way to get IIIF manifests from the live data!

Hi Jolan, isn't GitHub a marvelous place. Love that you reach out to me like this 👍

I've had a brief look at the links you sent me and correct me if I'm wrong, but I still miss a way to get the manifest URL.

For instance, if I look at the output of the search API, I see links to images like so:

{ label: "http://iiif.onb.ac.at/images/ANNO/zwb18580103/00000001/full/full/0/default.jpg", count: 2 }, { label: "http://images.icar-us.eu/iiif/2/img%2FAT-ADG%2FDKA%2F16-11-G.jpg/full/full/0/default.jpg", count: 2 },

but there is no obvious way to go from that (Image API: https://images.icar-us.eu/iiif/2/img%252FAT-ADG%252FDKA%252F16-11-G.jpg/info.json) to the manifest (Presentation API), which Pictor uses to harvest the metadata.

Or am I missing something? Thanks in advance!

Gosh, sorry for this taking so long to get back to you Tom! Here's some more info on how to retrieve Manifests from our API:

Manifest links are stored in the 'dctermsIsReferencedBy' field. If we look at this random object we can find its manifest in that field:
https://api.europeana.eu/record/744/item_1276034.json?wskey=api2demo
here is the manifest found in that record:
https://digitalcollections.universiteitleiden.nl/iiif_manifest/item:1276034/manifest

We can facet on the 'dctermisreferencedby' field by checking out the Search API documentation and finding out that the dctermsIsReferencedByField can be called using the Search API using the parameter 'wr_dcterms_isReferencedBy', because it's part of the Web Resource, hence the 'wr' prefix. Here's an API call asking to facet all items that are IIIF-enabled on that 'dctermsisreferencedby' field: https://api.europeana.eu/record/v2/search.json?facet=wr_dcterms_isReferencedBy&profile=facets&query=provider_aggregation_edm_isShownBy%3A(*iiif*)%20OR%20provider_aggregation_edm_isShownAt%3A(*iiif*)%20OR%20provider_aggregation_edm_object%3A(*iiif*)&rows=0&start=1&wskey=api2demo&f.provider_aggregation_edm_isShownBy.facet.limit=4000
and then you can use the same hack from my previous comment to get the next 4000 items. Hope this helps <3

I'd also like to add that recently we've added our own manifests to virtually every item that exists on europeana.eu! you can connect with the Europeana IIIF Manifest API by reading our IIIF API docs. To get the manifest for any Europeana record, enter that record ID (that you can find in the 'about' field) in the following URL Structure:
https://iiif.europeana.eu/presentation/[RECORD_ID]/manifest
so to get the Europeana manifest for the example record I used above with id /744/item_1276034, its Europeana manifest would be https://iiif.europeana.eu/presentation/744/item_1276034/manifest. We also have manifests for items that aren't served to us as IIIF records themselves. e.g. this record https://www.europeana.eu/en/item/90402/SK_A_3262 we have the manifest https://iiif.europeana.eu/presentation/90402/SK_A_3262/manifest

Thanks for the info! Can't promise I'll look at it very soon, but it's definitely on my to-do list!

that's because I sent you a misformed API query, making me the noob in this scenario ;) the facet we want is =wr_dcterms_isReferencedBy but the facet limit and offset I specified is for a different facet altogether f.provider_aggregation_edm_isShownBy.facet.limit=4000. That should be f.wr_dcterms_isReferencedBy.facet.limit=4000. So a correct API call that returns 4000 facet responses should be:

https://api.europeana.eu/record/v2/search.json?facet=wr_dcterms_isReferencedBy&profile=facets&query=provider_aggregation_edm_isShownBy%3A(*iiif*)%20OR%20provider_aggregation_edm_isShownAt%3A(*iiif*)%20OR%20provider_aggregation_edm_object%3A(*iiif*)&rows=0&start=1&wskey=api2demo&f.wr_dcterms_isReferencedBy.facet.limit=4000&f.wr_dcterms_isReferencedBy.facet.offset=0

While looking at the results here it became clear to me that a lot of these results still seem to refer to IIIF Image API results and not only to IIIF manifests. So I edited the query to search the wr_dcterms_isReferencedBy field for the term 'manifest', instead of searching the provider_aggregation_edm_isShownBy field for the term 'iiif'.

https://api.europeana.eu/record/v2/search.json?facet=wr_dcterms_isReferencedBy&profile=facets&query=wr_dcterms_isReferencedBy%3A(*manifest*)&rows=0&start=1&wskey=api2demo&f.wr_dcterms_isReferencedBy.facet.limit=4000&f.wr_dcterms_isReferencedBy.facet.offset=0

So far so good, these all seem to be IIIF manifests. Now, I tried getting the next 4000 by setting the offset to 4000 but got an error

Error from server -: Error from server -: Expected mime type application/octet-stream but got application/json. {\n \"error\":{\n \"metadata\":[\n \"error-class\",\"org.apache.solr.common.SolrException\",\n \"root-error-class\",\"org.apache.solr.common.SolrException\"],\n \"msg\":\"application/x-www-form-urlencoded content length (2752118 bytes) exceeds upload limit of 2048 KB\",\n \"code\":400}}\n

probably because we're being too ambitious with requesting 4K facet results at a time, exceeding Solr's new data upload limit. So I decreased the facet_limit to a more generous 2K, which circumvents this issue:

https://api.europeana.eu/record/v2/search.json?facet=wr_dcterms_isReferencedBy&profile=facets&query=wr_dcterms_isReferencedBy%3A(*manifest*)&rows=0&start=1&wskey=api2demo&f.wr_dcterms_isReferencedBy.facet.limit=2000&f.wr_dcterms_isReferencedBy.facet.offset=4000

there you go! if you keep increasing the offset by 2K and harvesting the results you should be able to get through all 5 million manifests that way.

The saga continues!

I tried a 2K limit, then 1K, 500, all the way down to 10 facet results at a time, but even this runs into this upload limit:

https://api.europeana.eu/record/v2/search.json?facet=wr_dcterms_isReferencedBy&profile=facets&query=wr_dcterms_isReferencedBy%3A(*manifest*)&rows=0&start=1&wskey=api2demo&f.wr_dcterms_isReferencedBy.facet.limit=10&f.wr_dcterms_isReferencedBy.facet.offset=7400

I'm not sure why 50 facets at a time would cause this, I do see that it is a common thing that people run into with Solr and that you can consider to increase the max limit:

https://learn-share.com/post/resolve-application-x-www-form-urlencoded-content-length-x-bytes-exceeds-upload-limit-of-2048-kb-issue-in-solr/

https://opensolr.com/faq/view/opensolr-wiki-q-a/97/Content-length-exceeds-upload-limit-of-2048-KB

One final consideration: it always seems to be at offset +/- 7000K that we hit this error. Maybe there's something anomalous in the data there?

https://api.europeana.eu/record/v2/search.json?facet=wr_dcterms_isReferencedBy&profile=facets&query=wr_dcterms_isReferencedBy%3A(*manifest*)&rows=0&start=1&wskey=api2demo&f.wr_dcterms_isReferencedBy.facet.limit=2000&f.wr_dcterms_isReferencedBy.facet.offset=6000

https://api.europeana.eu/record/v2/search.json?facet=wr_dcterms_isReferencedBy&profile=facets&query=wr_dcterms_isReferencedBy%3A(*manifest*)&rows=0&start=1&wskey=api2demo&f.wr_dcterms_isReferencedBy.facet.limit=250&f.wr_dcterms_isReferencedBy.facet.offset=7000

https://api.europeana.eu/record/v2/search.json?facet=wr_dcterms_isReferencedBy&profile=facets&query=wr_dcterms_isReferencedBy%3A(*manifest*)&rows=0&start=1&wskey=api2demo&f.wr_dcterms_isReferencedBy.facet.limit=100&f.wr_dcterms_isReferencedBy.facet.offset=7200

https://api.europeana.eu/record/v2/search.json?facet=wr_dcterms_isReferencedBy&profile=facets&query=wr_dcterms_isReferencedBy%3A(*manifest*)&rows=0&start=1&wskey=api2demo&f.wr_dcterms_isReferencedBy.facet.limit=50&f.wr_dcterms_isReferencedBy.facet.offset=7200

And we've hit the point where my knowledge can't help you any further I'm afraid, I will forward this thread to our API team at Europeana to get their input. Thank you for diligently documenting all of this together with me, this should make the API team's job a lot easier :)