WordPress/openverse-api

Increase the dead links cache expiration timeout

obulat opened this issue ยท 7 comments

Problem

Currently, every front-end search results in requesting the provider image URL (the direct URL) twice: once for the API dead links filtering and once for the thumbnail generation.
The thumbnails are cached for a period of around a year (to be confirmed), and the link validation cache expires after 24 hours for active links and 120 days for broken links. This means we're re-requesting popular valid images every 24 hours. We could increase the cache expiration timeout to reduce the number of network requests to the providers (which have often been 429-ed lately).

Description

We could set the cache expiration timeout for active links based on an environment variable to make it easily tweakable.
I think increasing the active link cache TTL to one week could be a good balance between keeping the link status up-to-date and reducing the number of requests. What do you think about the timeout period, @openverse-team?

Alternatives

Another option for reducing the number of requests to providers could be to request the thumbnail endpoint first, and if the request headers show that the response header comes from the Cloudflare cache, consider the link not dead. If, however, the response is coming from the proxy and does not have a 200 status, then we can consider the link dead.

Implementation

  • ๐Ÿ™‹ I would be interested in implementing this feature.

Another option for reducing the number of requests to providers could be to request the thumbnail endpoint first, and if the request headers show that the response header comes from the Cloudflare cache, consider the link not dead. If, however, the response is coming from the proxy and does not have a 200 status, then we can consider the link dead.

If we did this then the API would need to create an access token for itself to use, just a heads up. Otherwise the API would be rate limiting itself ๐Ÿ™‚

One year is correct for the thumbnails:

CleanShot 2022-08-09 at 10 19 04@2x

The cache-control max age is 31536000 seconds, or 365 days (31536000s / 60s / 60m / 24hrs).

Been thinking about this a bit. Have some thoughts and complications:

  1. Thumbnail proxy does not solve the problem for audio, but the same process is followed for audio (requesting upstream url and using response status code to determine liveness of the result). Audio does use the thumbnail proxy but for the artwork. Audio link liveness check uses the actual result url, not the upstream thumbnail location on the audio result.
  2. If we're relying on Cloudflare's caching, are we accepting that a year is a reasonable amount of time to assume that a result is still valid after checking it? If so.... why not just increase the expiration in our existing validation cache to 1 year and skip the network request entirely.
  3. If we use thumbnail proxy then we're adding some potentially non-trivial latency to the validation check:
    1. Send request to thumbnail proxy -> hits cloudflare -> hits the API -> redirect to the internal thumbnail service -> request goes upstream to the result url -> thumbnail proxy processes and returns the response -> now we have the status code we wanted.
    2. Currently the only process is: request goes to upstream url -> how we have the status code we wanted.

I believe that increasing the TTL of the 'live' links to 1 week (or 1 month?) together with WordPress/openverse#685 and further work on the catalog side would be better solution than using the Cloudflare cache.

The thumbnail cache would probably not allow us to fully develop WordPress/openverse#685 because we wouldn't get the status code of the actual request.

Sounds great Olga. If we could make them configurable as environment variables that would be excellent too and allow us to tweak things if we notice issues.

Additionally, if we do this, I think we should also update the frontend to more explicitly report when dead links are found on the frontend itself. This will reveal the rate of discrepancy: https://github.com/WordPress/openverse-frontend/blob/f3f03b4039035f8e7887cde99460d7a45fd6b5d1/src/pages/image/_id.vue#L154-L166

Hmm, would there be a reason to add "broken image reporting" to the API reporting endpoint? Perhaps our frontend code could call that endpoint in the part of the code Sara linked.

We'd also have the opportunity to let users report images that appear broken via the frontend form. I don't know if that would be used much, though, or if it would just result in a lot of duplication with the automated broken image reporting.

Something to also keep in mind here, we've discussed in the past using the full thumbnails on the single frontend results instead of hotlinking to the source images, which would do many things:

  • Reduce traffic to sources
  • Obscure broken images more (the user wouldn't know its broken until visiting the source)
    • Maybe we'd also want to do a head request clientside on the foreign landing url and report those failures to the API
  • Reduce single result page load times and size

Hmm, would there be a reason to add "broken image reporting" to the API reporting endpoint?

Maybe... but we don't handle the information gathered from those endpoints well at all now. Adding new data there may be counter productive. I'm not sure. I mostly want us to be able to sort of query the ratio of links we reported as live (returned from the API) but then end up not working on the client side. I don't know how to do this safely though.