RFC: caching descriptors
fbs opened this issue · 8 comments
background
In our setup each team has their own set of google projects for their various applications. Each project is 'isolated' and thus runs their own stackdriver_exporter. To reduce the amount of metrics (and cost) they usually have fairly specific set of prefixes, so instead of:
pubsub.googleapis.com/subscription
we have:
- pubsub.googleapis.com/subscription/seek_request_count
- pubsub.googleapis.com/subscription/sent_message_count
- pubsub.googleapis.com/subscription/num_outstanding_messages
- pubsub.googleapis.com/subscription/num_retained_acked_messages
The core logic of the exporter seems to be:
for prefix in prefixes:
descriptors = get_descriptors(prefix)
for descriptor in descriptors:
get_metrics(descriptor)
Due to the specific prefixes the amount of 'get_descriptors' calls is almost equal to the amount 'get_metrics' calls. As google bills per API call nearly half our costs are on descriptors calls. Afaik those descriptors for google services are static, so it feels like a bit of a waste.
proposal
Adding a user configurable prefix -> []*monitoring.MetricDescriptor
cache that will help to reduce 'useless' descriptor calls and the cost of using the exporter.
The cache itself can be simple. If the cache expired and two requests come in at the same time its ok if they both refresh the cache to avoid blocking. The last one to update the cache will 'win'.
The cache should be disabled by default and can be enabled with a simple flag:
--monitoring.descriptor-cache-ttl=30m
Questions
- should the cache be limited to only
*.googleapis.com
or should it include custom descriptors as well
POC
Created a basic implementation to test in our landscape, it can be found here
Yea, that seems like a good idea to me. What would be the down side of say, 24h cache TTL?
For metrics from google services I can't think of any, as afaik they rarely change.
As we only 'consume' google provided metrics I'm not familiar with other use cases. They have a way to provide your own timeseries and descriptors, in that case stale cache might be more likely. Hence the open question about limiting the cache.
Sounds reasonable. Maybe a 1h default is fine.
I did the same thing before. In order to reduce the number of Ali Cloud api calls, I cached (1h) the metadata (all indicators under current namespace + all instances under current namespace) of each repeated request (1m). Later,When metadata synchronization is abnormal (network or other unexpected problems) (metadata synchronization fails or data is missing), my monitoring data is abnormal for about 0 to 1 hour.
In order to prevent such a situation from happening again, I choose to synchronize source data every time to ensure that only the monitoring data within the current period (1m) will be affected when an exception occurs.
I would choose to pay extra api fees to ensure the availability of monitoring data :)
Does this apply to google cloud too? If the API call fails there is nothing to cache, next time it will just try to list descriptors again.
If google only, but silently, returns partial data the cache will indeed be invalid. But i do wonder whether that ever happens. Guess we could write it down as potential risk?
@fbs hi fbs
The policy I'm currently using with our cloud exporter is to re-obtain metadata with each synchronization period, to prevent the "dirty cache" from the last time.
But it's a condition that our exporter is also synchronizing full instances of each product in addition to the list of indicators (for example, thousands of ecs/gce/cvm, etc., with label associations), and I think the synchronization is becoming much more unusual because of the latter.
I don't think you need to worry too much if you only synchronize metrics metadata as you currently do. As long as the cache duration is not too long (1d/1w), it should not be a problem.
ah yeah thats fair.
For our usecase I was thinking about a 30/60m cache duration, its already a significant cost reduction and we won't have to worry too much about stale cache issues
actually starting to doubt now whether these specific API calls are billed :/. Thought they were billed as read call, but not seeing it. Less api calls still has some other benefits, but not the huge impact I hoped for.
Building a cache to reduce the impact of the HA setup might be more worthwhile