ben-manes/caffeine

cache performing worse than LRU

phraktle opened this issue · 42 comments

Hi @ben-manes,

I have created a trace from a production system where I see Caffeine underperforming a simple LRU strategy (with a cache size of 100k items, Caffeine hit rate is 72.35% vs 75.91% with LRU): scarab-recs.trace.gz

The format is a series of longs (can be decoded with DataInputStream#readLong)

Any thoughts if this could be improved upon?

Thanks,
Viktor

Thanks! Did you already integrate this into the simulator or should I go ahead and do that?

I will submit a pull request shortly.

Perfect! If we decide to check-in the trace then we should use xz which reduces the file from 7.1mb to 3.1mb.

The results are interesting.

Policy Hit rate
ARC 76.09
Lirs 74.30
Lfu 73.95
Lru 75.91
Lru_TinyLfu 67.02
Clairvoyant 79.25
Unbounded 79.25
WindowTinyLfu 71.57
AdaptiveWindowTinyLfu 72.08

It seems that frequency is a negative signal in this workload. That's surprising given how large this cache is, as usually recency has little effect past a short duration and then frequency plays a major role. The adaptive window has a maximum, so the gain is slight. The optimal window size is 100%, meaning LRU is the best we could achieve. None of the policies show an ability to improve upon LRU, which seems to be the optimum.

If we decrease the cache's size to 5k or 10k then frequency has a positive impact, increasing from 41.20 (LRU) to 46.54 (W-TinyLFU) to 47.52 (adaptive). But at a larger size its a disadvantage.

Interesting indeed. The workload is product recommendation lookups for several hundred e-commerce sites of varying sizes worldwide. I don't currently have an intuition as to why frequency would be a negative signal here (I would have assumed popular products should cache well based on frequency).

Any options to configure Caffeine to use an LRU policy?

Not presently, as ideally users shouldn't need to worry about the implementation details. I'm hesitant to expose a setting.

I think our goal is to have a policy that the best (or very close) in all workloads. So being 3-4% worse here, while 20-30% better in other traces, is an okay trade-off.

I don't have any good ideas, yet, for how to optimize for this peculiar trace. It would be nice to shrink that gap.

Do you have a prior filtering scheme before entries go into the cache? e.g. ElasticSearch uses an LRU but only allows candidates that have passed a heuristic. The intent seems to be to only let frequently used entries arrive at the cache.

At that point, perhaps, frequency no longer provides a strong signal so only recency is beneficial. The hope of a smarter policy like TinyLFU is that those heuristics could be removed.

There's no prior filtering or heuristic, it's just a straightforward LoadingCache.

This is very interesting,

Ben how large is the window of W-TinyLFU ? I think that perhaps as a workaround it can be configured with a large window cache and small LFU cache? (are these configuration parameters exposed?) Perhaps with 99% LRU and 1% LFU for example, it would work quite nicely and Viktor can use Caffeine with the improved hit rate.

Thanks for the trace Viktor, we'll defiantly try to come up with something.

@gilga1983 The window is 1% and is not configurable. Hopefully we can find improvements and not need to expose the policy's implementation details to users. That gives us the most flexibility to make changes.

@phraktle Can you acquire a longer trace?

An interesting aspect is that your trace length is ~10x the cache size, which is also the sample period for TinyLFU before it resets. If perform 3 runs of the trace in sequence, then I've artificially created frequency and W-TinyLFU outperforms LRU. But reality is that new items would probably show up. A trace that is 100x or more than the cache size might be helpful.

My hope is that frequency does play a role long term and we can argue this is poor warm-up, but over a long period it evens out to be a gain.

To highlight why I'm asking, if I run the trace 6 times (7.5M requests) in sequence @ 100k,

Policy Hit rate
ARC 76.96
Lirs 76.73
Lru 77.23
Lru_TinyLfu 75.59
Clairvoyant 85.94
Caffeine 82.48
WindowTinyLfu 80.47

So in this artificial run, TinyLFU is able to extract frequency and the window handles recency. Caffeine differs from the theoretical algorithm, e.g. adds HashDoS protection, which in this case is visible (but usually differs as noise).

Here's a longer trace for the same database (~1 hour period): http://download.scarabresearch.com/cache-traces/recs.trace.20160808T073231Z.xz

(There's also a trace of accesses to a somewhat different, but related database for the same timeframe: http://download.scarabresearch.com/cache-traces/prods.trace.20160808T073231Z.xz)

That's great, thanks! My hypothesis didn't hold but the difference are smaller.

recs @25k @50k @75k @100k
Lru 67.71 75.49 79.42 81.77
Caffeine 69.03 74.63 77.51 79.38
Clairvoyant 80.55 85.06 86.98 88.03
pods @25k @50k @75k @100k
Lru 70.54 78.85 83.25 86.12
Caffeine 75.66 81.65 84.75 86.75
Clairvoyant 83.13 88.44 91.00 92.52

Indeed, the differences are not dramatic – depending on the time period / exact workload Caffeine tends to stay within being less than 5% worse than an LRU strategy on this workload.

I hope the trace gives you some ideas for potential improvements, or perhaps justifies adding an option to switch the strategy to plain LRU. But if not, I'm okay with closing this issue :)

We're working on something toward adopting the policy also to these kind of traces in the Technion, but it is research and even if successful will take a while.

My benchmark runs of different eviction algorithms also showed a performance lower then LRU on a database access trace (see: "OrmAccessBusytime Trace"). The graphs show that Caffeine actually performs worse then random eviction for small cache sizes. Since there are more traces like that, maybe its worth digging into.

Did run the recs trace against cache2k and others today. Algorithms prefixed with "S/" use the respective Caffeine simulator. The implementations of Clock, ARC, CAR and RAND use the code from the cache2k-benchmarks:

Hit rates for Scarab Research recs trace with different cache implementations

Interestingly the trace seams in favor of cache2k.

This is all about the dynamic between recency and frequency. Some policies are recency skewed and try to compensate for frequency workloads (cache2k w/ clockpro). Other policies are frequency skewed and try to compensate for recency workloads (caffeine w/ tinylfu).

For small caches under a recency-skewed workload, a frequency-biased policy is hit the hardest. Fortunately increasing the cache size isn't too detrimental, so the cost is small in these worst cases. Unfortunately recency-biased policies under perform for large traces, which are more typically frequency-biased, where it may be less affordable to increase the cache size. Those cases include analytical, search, and database applications which are already memory intense. When analyzing those traces we see Caffeine significantly outperform Cache2k.

The goal then is for each algorithm to try and compensate for its weaknesses, and strike an overall balance that best optimizes system performance. For TinyLFU this weakness was discussed above and is described in the research paper. The best way to compensate, dynamically, is still open for debate.

  • Bloomfilter feedback policy adjusts the admission window based on mispredictions. This corrects for small traces, allowing it to slightly outperform cache2k in orm-busy.trace. But the feedback fades away as the cache size grows, by assuming large caches are frequency-skewed, which is not the case in scarab traces.
  • Hill climbing policies are still very much a prototype and not yet finished. The idea here is to detect the curve for different admission window sizes, climb to the optimal value, and restart when the workload changes. There are a few variants of this idea and it should work on any cache size, but has not yet been fully flushed out.
  • Adjusting the counter increments is another approach. For example using logarithmic counters with an admission boost, e.g. see Redis LFU. This would give new arrivals a higher chance of admission, so perhaps dynamically adjusting the boost size would compensate.
  • Researchers at the Technion are looking at an LRFU inspired admission filter. This would mean the filter itself takes into account recency and frequency when making the admission choice.

So just like with sorting algorithms, each cache policy is best in certain circumstances. The more advanced sorts are in fact hybrids by adapting to a scheme based on heuristics. They try to be the most well-rounded, though perhaps not optimal in every scenario. We're trying to do the same with caching, where more experimentation is needed to discover the best heuristics.

Simulated Annealing, a hill climbing algorithm, shows good results. In the recs trace it corrects for recency and adapts to match LRU's hit rate. The pods trace shows only slight differences, since it was already friendly to W-TinyLFU. Similar improvements are observed in other recency skewed traces, such as the orm-busy trace mentioned earlier.

I think there is still some fine tuning left, which I hope @gilga1983 can help out on. Once we're satisfied then adding the improvements to Caffeine should be cheap and straightforward. Thanks for the patience @phraktle.

recs @25k @50k @75k @100k
Lru 67.71 75.49 79.42 81.77
WindowTinyLfu 68.83 73.76 75.07 77.52
AnnealingWindowTinyLfu 69.41 76.20 79.28 81.54
Clairvoyant 80.55 85.06 86.98 88.03
pods @25k @50k @75k @100k
Lru 70.54 78.85 83.25 86.12
WindowTinyLfu 75.70 81.46 83.71 86.04
AnnealingWindowTinyLfu 74.49 81.69 84.78 87.19
Clairvoyant 83.13 88.44 91.00 92.52

[ ... ] Those cases include analytical, search, and database applications which are already memory intense. When analyzing those traces we see Caffeine significantly outperform Cache2k.

Can you please share more detail about these observations, especially name and source of the traces, so this can be reproduced?

I am curious because for all database traces I did analyze so far (Oltp, OrmAccessBusytime/Night, UmassFinancial1, UmassFinancial2) cache2k performs better then Caffeine. The results are published in the blog article Java Caching Benchmarks 2016 Part 2

The OLTP and Financial traces are the transaction logs, which mean they are record/replay of commits which has a low impact on performance. It is not the critical database page cache, which may include scans, et al. The ORM traces behave similar to what one would expect from a Hibernate / JPA cache with frequent invalidation and flushes. That too is not similar to many webapp usages which try to avoid db calls entirely, though I agree its a more practical example. I'd have to dig in and compare again, but off-hand I know cache2k had less than 1% on the glimpse trace. All of this was discussed in private email so it is not new revelations. Therefore, please refrain from comments that appear to be evangelical, but otherwise I'm glad to have your input.

Gil plans on pushing some new work of his soon which looks promising. Instead of adapting the window, he adjusts the reset interval. This ages faster (recency) or slower (frequency) and resides within the TinyLFU filter. We'll have to run more extensive tests to evaluate its robustness. It may be the ticket or we might merge other ideas previously discussed.

Therefore, please refrain from comments that appear to be evangelical, but otherwise I'm glad to have your input.

Commenting on a statement like "Caffeine significantly outperforms Cache2k" for many different application traces, which cannot be reproduced without further information, will probably always "appear evangelical".

I'd have to dig in and compare again, but off-hand I know cache2k had less than 1% on the glimpse trace.

Yes. I also published that along with the blog article at Java Caching Benchmarks 2016 Part 2 graphs. We can see that there are traces and cache sizes where Caffeine performs better and also others where cache2k performs better. I cannot see that one cache is significantly outperforming the other in "analytical, search, and database applications". Maybe there is a slight tendency that Caffeine is doing better with the (legacy) short traces and small cache sizes, while cache2k is doing better with the longer and more recent traces.

The Clock-Pro algorithm plus my augmentation to it, has a lot of tuning parameters. For example, a cache consists of two data sets "cold" and "hot" data, with "hot" data being the entries that are seen more often. When an entry is considered as hot, when it is removed from hot and how big the hot portion is sized, is one of the many tunable elements. That means that Clock-Pro is adoptable to different workloads and is not weak on recency or frequency biased traces per se. During my experiments I found tuning parameters that made the eviction superior to any other, but only for one specific scenario. The reason cache2k is underperforming in Glimpse is, that I specifically chose not to tune against it in favor of performing better for more realistic workloads. What is realistic and important is of course very debatable, so I admit that I specifically tuned against the traces and workloads we see in our internal production environments.

Over all, there is still room for optimization. Looking at the benchmarks both, Caffeine and cache2k, cannot claim to be better then LRU in every scenario.

This paper claims an algorithm that automatically adapts to the workload. In this trace it would fall back to behave similar to LRU.
http://www.cs.cmu.edu/~beckmann/publications/papers/2018.nsdi.lhd.pdf

Thanks. @erikvanoosten. I haven't simulated LHD yet and their published code is not runnable. My experience is that sampling policies underperform due to the inability to capture history and cannot correct if they make a mistake. At best they tend to match LRU or LFU, but never exceed it. Policies like ARC, LIRS, and TinyLFU can greatly out perform classic policies.

@ohadeytan and team have been working on adaptive schemes. The work so far has been quite impressive and seems to fix the problem, cheaply. That work is under review and he can speak about it, if interested. After its public I'd like to work with him on integrating the improvements into Caffeine.

FYI, the research paper with our strategy to fix this automatically is now freely available if you follow the README's link (see Adaptive Software Cache Management). This uses ACM's Authorizer service to let you bypass the paywall.

I would like to explore adaptive moment estimation (adam) as an enhancement to the naive hill climber shown in the paper. The paper demonstrates that the techniques work, but leaves deeper optimizations to the community. In the case of hill climbing (aka gradient descent), the ML community has very good work on improvements. I am hoping to find some time over the holidays to implement their ideas.

The time/space overhead to the cache was shown to be minimal and the code complexity is low. After I have satisfied myself on the exact algorithm to adopt, I hope to find the time to implement it. This should then correct for this problem without requiring a user-visible configuration.

After experimenting with gradient descent (Adam), I've come to the conclusion that it isn't a good fit because I do not know how to provide an instructive gradient equation. The equation is to minimize the Δ(miss rate), which can lead to converging at a low plateau. If the initial configuration is a very poor fit and the first step moves in the wrong direction, the hit rate might only change by 0.02%. Then it converges immediately, which is not what we want. The usual solution to this is random restarts at other configurations, which adds overhead to shift dramatically, and doesn't seem worth it. Alternately giving it a few fake gradients to start with works, but that feels error prone and might break down in practice.

In comparison, hill climbing doesn't get trapped and decaying to convergence seems more robust. It lacks some of the advanced features (momentum), but doesn't seem to need them yet. If we later obtain traces where hill climbing fails then we can add various improvements.

See this sheet for examples.

Now that I have conclusions on the different approaches, I'm ready to start doing the development in the library.

IIUIC the usefulness of the smarter algorithms grows with the dimensionality of the problem. I'm not sure how many dimensions your problem has, but it's surely nothing like neuronal networks, where the Ada* algorithms excel. I guess, the efficiency of the smarter algorithms gets partly lost due to noise and caching performance is IMHO pretty noisy.

There's also some "inertia" (unrelated to the momentum used by the optimizers) due to the cache content being dependent on the previous settings (IIRC this is what killed my ideas as I played with it last year).

A naive question: could some kind of simple sampling approach help pick the right caching policy? What I mean is having two pools with different strategies (eg LRU and LFU), split requests between them randomly and pick the one with better hit rate metrics.

There are papers that try to do this by running multiple policies at once, but are wasteful. A straight Lru or Lfu are not peak performers as they don’t retain history outside of the working set, which has proven to be very valuable at improving hit rates.

A better case is MiniSim, which samples requests to model a cache at different sizes. However sampling requests loses recency/frequency information to indicate policy choices, so its not appropriate when adapting towards that goal.

In a policy that is split for recency & frequency, sampling to see which portion recieved more hits is unbalanced. When one portion is larger than the other then it skews the probability, even if divided by the region's size.

Instead we sample to compare if a configuration change had a positive impact. That’s very straightforward and climbing works well. I am trying to steal bits of time to implement it for a release soon, I hope.

Thanks @ben-manes, the MiniSim paper was interesting!

@phraktle I have an early branch with the adaptivity working - very simple to implement. I need to amortize the adaption, write tests, etc but it works. Sorry I haven't had any time, trying to steal some where I can but hard.

No rush on my account @ben-manes, at this point I just have an intellectual interest in this problem as I’m no longer involved with the project where this came up :)

The results from a mixed workload is pretty neat. At 512 maximum, Caffeine starts with the 1% window running Corda's small trace, then runs Lirs' loop trace, and then back to Corda's. That case might occur due nightly batch jobs run, shifting the request characteristics away from normal users. We see the adaption kick in and reconfigure itself to the new environment.

Policy Hit rate
Optimal 44.53 %
Arc 11.64 %
Lirs 32.49 %
Lru 11.64 %
Caffeine 2.6 22.81 %
Caffeine 2.7 41.31 %
Cache2k 0.38 %

(Included cache2k as maybe our ideas can be applied to improve its use of ClockPro)

mixed

Included cache2k as maybe our ideas can be applied to improve its use of ClockPro

Thanks Ben! The results look indeed very promising. Opened an issue in cache2k so I have it on my radar.

and we've merged! 😺

Released 2.7

@ben-manes, I'd like to reproduce your tests above. Unfortunately, it seems the traces, especially Corda isn't available. Can you give some pointers or make those traces available?

Corda trace files and parser are in the repository:
https://github.com/ben-manes/caffeine/tree/master/simulator/src/main/resources/com/github/benmanes/caffeine/cache/simulator/parser/

I paired the larger one with Lirs loop x5 for a nice adaptation chart in the wiki, taking the hit rate over sample periods.

You might be interested in the recent ClockProPlus paper which applies ARC style adaptivity onto ClockPro. That leverages the ghost entries you already have, whereas I use classic hill climbing which doesn’t need them.

@ben-manes great work! Everything is in the repository and I missed it ;)

Do you have a reference on that paper?