karpetrosyan/hishel

How to implement manual cache invalidation?

Closed this issue · 2 comments

I am writing a tool that gets data from an API that always claims the responses are uncacheable, even though the data is mostly rather static. Using the code from #173 (comment), I rewrite the headers to still cache all responses, and this provides a tremendous speedup.
As said in #173, I am using code generated from the OpenAPI spec, so I don't have easy access to the single requests but am switching between a caching and a non-caching client that is an input parameter to the generated Python functions.

Now, I'm looking into cache invalidation. The API provides a way to get a list of data IDs that have changed since a certain point in time.

My current thinking goes like this (I'm using FileStorage):

  • I read a timestamp file in hishel's cache directory.
  • I record the current timestamp in hishel's cache directory.
  • I query the API for changed data (using the timestamp from the first step) using the non-caching client.
  • I remove the cache of the API endpoints that point to changed data (#241 will help here).
  • I hit all those API endpoints again using the caching client.
  • My business logic will see fresh data.

Does that sound workable, or is there an easier way even?
Can I somehow say "ignore the cached data for this API endpoint but write out new cache data"?

We have introduced force_caching for the controller class, you can use it instead of writing a custom transport as suggested in #173 (comment).

When force_caching is enabled, responses are invalidated only when the TTL of the storage expires. So yes, I think you can enable it and call some vacuum function that will invalidate all the stale responses and remove them using the API that will be introduced in #241.

Here is how it can look like for the filestorage

import hishel
import httpx

class AsyncVaccumFileStorage(hishel.AsyncFileStorage):

    async def vaccum(self):

        async with self._lock:
            # vaccum stuff here
            ...

storage = AsyncVaccumFileStorage()
cache_transport = hishel.CacheTransport(
    controller=hishel.Controller(force_cache=True),
    storage=storage,
)

await storage.vaccum()

Can I somehow say "ignore the cached data for this API endpoint but write out new cache data"?

Yes, you can write a custom controller to do that, like so:

from httpcore import Request, Response
import hishel
import httpx

from hishel._serializers import Metadata

class MyStorage(hishel.FileStorage):

    def store(self, key: str, response: Response, request: Request, metadata: Metadata | None = None) -> None:
        print('storing response', key, response)
        return super().store(key, response, request, metadata)

class IgnoreCacheController(hishel.Controller):
    def construct_response_from_cache(
        self, request: Request, response: Response, original_request: Request
    ) -> Response | Request | None:
        if request.extensions.get("ignore_cache"):
            return None

        return super().construct_response_from_cache(request, response, original_request)


client = httpx.Client(
    transport=hishel.CacheTransport(
        transport=httpx.HTTPTransport(),
        controller=IgnoreCacheController(),
        storage=MyStorage(),
    )
)

client.get("https://hishel.com")
response = client.get("https://hishel.com", extensions={"ignore_cache": True})
assert response.extensions["from_cache"] is False

Thank you very much for this detailed response. I have modernized my usage with the controller's force_cache now and am looking forward to the new remove method for storage.