WordPress/openverse-catalog

SMK (statensmuseum) images taking too long to be validated

AetherUnbound opened this issue · 11 comments

Description

When we recently tried to enable the statensmuseum provider, searches which included results from that provider ended up timing out on the Cloudflare -> API end. After some investigation, it appears that all the urls we have for statensmuseum images are taking a significant time to return (even on HEAD requests). This is causing validation to fail for all images returned in a result set, and I believe that the API isn't able to return a response quick enough for Cloudflare.

We are going to reach out to SMK to see if they can help us rectify this issue.

Reproduction

  1. Attempt a HEAD request: time curl -I https://iip.smk.dk/iiif/jp2/KKSgb5781.tif.jp2/full/max/0/default.jpg
  2. Observe that it takes longer than our timeout of 2 seconds

Expectation

These HEAD requests should ideally be very fast, and we should ideally be able to see Statens Museum results in the API.

Additional context

Resolution

  • 🙋 I would be interested in resolving this bug.
negon commented

The links to the full versions of our IIIF images are probably taking some time, as not all 50000 images are cached.
But far from all or works has IIIF versions, and its better to use image_native and image_thumbnail fields instead.

I hope you can use the images, are you downloading them locally to openverse or linking to them?
Regards
Nikokaj from SMK

Thanks for the input @negon! We can definitely switch to the image_native and image_thumbnail fields.

For the small images we display on our search results page, for example https://wordpress.org/openverse/search/?q=byzantine, we generate and cache our own copies of the images. On single result pages, like https://wordpress.org/openverse/image/49ccf441-1c54-4a24-8df0-c76d40767303, we link directly to the source.

Are there particular sizes from your thumbnail endpoints that are most likely to be cached? One more question—how does the Open SMK site determine which images to use? For example, I see this page uses this url:

https://iip-thumb.smk.dk/iiif/jp2/kksgb20980.tif.jp2/full/!2048,/0/default.jpg

while this page

uses the following url:

https://api.smk.dk/api/v1/thumbnail/13a529da-10a7-4837-ae06-66be94473036.jpg

negon commented

The issue with images is that there are HQ images and "legacy" images.
The HQ ones are also the ones that are IIIF enabled and therefore have a iiif-thumbnail link (iip-thumb.smk.dk), while the legacy images are taken directly from the collection database and have the none iiif thumbnail link (api.smk.dk)
And unfortanately we cant control the legacy ones. And they have very different qualities.

Ah, I see! Is image_hq the way to detect if an image is legacy then?

negon commented

yes!

@AetherUnbound Can you confirm this issue's prioritization?

Yes, I believe this is still high priority. We have Staten Museum data but we're unable to use it as a result of this, since doing so slows the API responses.

Okay, should it be slated for a milestone then? The definition of high we have on the issue label description is that it is blocking something else. I suppose this blocks using the provider we've set up otherwise?

negon commented

Please say if we can help in any way.
@AetherUnbound do you mean that our data slows your API down?

I believe it's because of what you mentioned in the comment above, @negon, that the images aren't cached so the HEAD request times out.

We make HEAD requests to filter for "dead" images in our API, and it seems that the requests to validate the Statens Museum images take too long and exceed our timeout of 2 seconds.

https://github.com/WordPress/openverse-api/blob/a7955c86d43bff504e8d41454f68717d79dd3a44/api/catalog/api/utils/validate_images.py#L36-L39

Based on the information you shared with Zack, it seems like this is actually a catalog issue, not an API issue, as the API is not using the most-optimized (or even correct/existing) resource? Based on the code in our catalog we are unconditionally assuming that iiif versions exist for images, so that could be the issue here:

https://github.com/WordPress/openverse-catalog/blob/88322d2da852324f1147417ff8da332721a9002a/openverse_catalog/dags/providers/provider_api_scripts/staten_museum.py#L0-L1

All of that amounts to me thinking this is a catalog issue rather than API, so I'll transfer it. If I'm wrong @AetherUnbound and @zackkrida y'all can transfer it back with how it should be fixed in the API.

Yes, the work necessary in WordPress/openverse#1673 should fix this problem.