Proposal: mechanism to clear-cache on downstream change
jacksontj opened this issue · 4 comments
As it stands today trickster simply caches the responses of the downstream prometheus API. If the downstream's configuration is changed in such a way that the data behind it changes-- trickster will continue to serve "stale" data. Here are 2 example cases to highlight the problem
(1) remote_read on a single prometheus host
If the prometheus host was configured to start pulling data from a remote_read endpoint all data in the trickster cache would be missing the "new" data from the remote_read endpoint.
(2) promxy downstreams change
As these systems scale more they get more complex, a great example here is promxy. TLDR (for those that need context) promxy stitches multiple prometheus hosts data together (single API endpoint as well as "stitching" together timeseries with holes). So to set the scene:
- promxy is configured to talk to 2 prometheus hosts configured to scrape the same targets
- host 1 is missing data for a period, but promxy is stitching data with host 2 to fill the gaps
- if host 2 were to become unavailable (restart, host dies, etc.) then the data promxy returns would have "holes" in it
This fundamentally is a distributed caching problem -- since the source data isn't static. So instead of inventing a new solution, I propose we use HTTP cache headers. Specifically I'm proposing that trickster support a mix of Etag and cache headers from the downstreams -- such that the downstream can determine (1) what the cache TTL should be and (2) ETag to know when it has changed.
For example, in this world promxy could return an etag which is a sha of the current configuration/state of promxy (downstreams, availability, etc.) combined with the query. This way when the TTL expires trickster can send a request with the If-None-Match
field which gives the downstream the opportunity to either (1) return 304 -- meaning the cache entry is still good or (2) return a fresh response.
To be clear, this doesn't entirely remove the cache-discrepant issue, it just provides a mechanism for the downstreams to control how stale the caches get.
Hey Thomas, I have an issue opened for much of this already (#143) but not in nearly as much detail as you've provided here! Thanks!
I'd like to start in 1.0 with a basic/generic reverse proxy object cache that respects all of the HTTP caching specifications, and when we have that fully vetted, we can move on to augmenting it to support evolving time series data cache management. So i think what that means is implementing #143 first, and then circling back to this issue once that part is completed. Does that sound OK to you?
One goal we have in Trickster is to support many different origin types, and not just Prometheus. We may actually launch 1.0 with support for as many as 4 origin types (including Prom, which will always be the gold standard for Trickster). With that in mind, I want to make sure any patterns we design here that are specific adaptations of what is permitted in the HTTP RFC's to support linear time series data, are done in such a way that they can be adopted easily by other solutions (e.g., possible promxy and thanos equivalents for those other origin types).
In the case of promxy, can I propose we work towards a more basic approach that should be easily instrumented by both Trickster and promxy? It works like this: If promxy knows that the data it is serving in a specific request actually has holes in it (because it knows it couldn't get results from one of the configured hosts), provide a Cache-Control: No-Cache
header in your response. Then when we instrument the basic HTTP Caching in Trickster, it would serve the data section flagged as no-cache
to the end user, but bypass store it into cache. In that way, that particular data section would continue to be requested fresh from promxy until the failed node is back up. Thoughts?
It sounds like we're on the same page. #143 should be to handle all the regular cache-control headers (so we can define TTLs etc). and then the only addition in mine is to support Etag + If-None-Match requests. The mechanisms I'm describing are regular RFC compliant mechanisms for caching that are regular to HTTP -- so hopefully we can re-use code to do it :)
As for using No-Cache
in response on failure, I could do so -- but the concern I'd have is the amplification of traffic in failure. For any of these failure modes if we want to clear the cache we'll require some increase in traffic to the downstream (since we'll have to fetch more data). In the case where the TTL goes from 1h -> 5m (lets say) it would be a 12x increase. If we went to a No-Cache
header then it would be potentially significantly more (depending on query volume incoming) -- which might be more load than we want to throw at the downstream (especially since it is in some sort of failure). Having the ability to do shorter TTLs + If-None-Match should give the best mix of control since the downstream could implement a "cheap" way to determine if the data has changed (presumably a mechanism that doesn't require actually fullfilling the query).
So, once #143 is implemented downstreams could at least set shorter TTLs for degraded data, the If-None-Match support would allow for significantly shorter TTLs without causing large spikes in load downstream. I did some looking around and something like https://github.com/gregjones/httpcache might be helpful (at least to get an example to mimic); that is an implementation of http caching as an http.Transport -- I don't know if you'll want to use it (since I imagine it doesn't fit into the new design) but it looks (from my 30s of reading there ;) ) that it implements the caching properly.
So it sounds like this issue has now become "support If-None-Match requests" (dependant on #143 )
@jacksontj check out the sweet new Cache Control abilities in Trickster 1.0 Beta 10! This includes support for ETag <-> If-None-Match and Last-Modified <-> If-Modified-Since revalidations, as well as Expires, Cache-Control, and the like, between Trickster and the Origin (be it Promxy or anything else). We also support these capabilities between the end client/dashboard server and Trickster. I will keep this issue open for now for further discussion, but hopefully this will meet your needs. Thanks again for the detailed request, and for your patience while we got all of the kinks worked out!
1.0 is now GA and we have not received additional activity on this issue since our previous comments. I will close this for now, but welcome it to be reopened if there are any issues with the cache control features of Trickster 1.0.