Dalli sometimes returns incorrect values

Question

Dalli sometimes returns incorrect values

cornu-ammonis opened this issue 2 years ago · 18 comments

Hi, thanks for all your work maintaining this repo! I am a security engineer investigating some incorrect Rails caching behavior observed after upgrading to the latest version from 2.7.x, and have identified a mechanism consistent with my findings:

There is a race condition in the pipelined getter implementation which allows for socket corruption. The key problem is that a lock (request_in_progress) is acquired on the connection only after sending the getkq requests; if something happens between sending those requests and acquiring the connection lock, the socket could be left in a state which is both corrupt and (superficially) ready to accept future requests.

In the corrupted state, subsequent operations which result in data being read from the socket will have incorrect behavior, because they expect that the next bytes on the socket are the responses to their own requests but the unread getkq responses will precede their own responses. This could cause the client to return incorrect values for given keys because the writing and reading operations are out of sync / "off-by-x", where x is the number of responses to the getkq requests.

Suggested fix

I propose we should instead acquire a connection lock before writing the getkq requests to the socket. This means that if something happens after writing the getkq requests but before reading their responses, the connection will be left in the locked (request_in_progress) state and will refuse to accept additional requests (resetting instead). I am opening a PR which attempts to do this.

Open question

Note that reading the 2.7.x code, it seems that it could have had the same issue here. I am unsure if something else changed to make this failure case more likely - or if it was just a coincidence. The time window for a timeout/failure leading to this bug is narrow, so this would certainly be a rare event with randomness at play; nonetheless I observed the incorrect behavior not long after upgrading to 3.2.3 which is suspicious and I would like to be able to explain what, if anything, changed.

Demo/POC

in Rails console I used the following steps to reproduce this issue:

# code to produce a timeout on the 3.x Dalli gem
# simulates a process timeout occurring between make_getkq_requests and finish_queries
Dalli::PipelinedGetter.class_eval do 
  def setup_requests(keys)
    groups = groups_for_keys(keys)
    make_getkq_requests(groups)
    
    sleep 10 
    
    # TODO: How does this exit on a NetworkError
    finish_queries(groups.keys)
  end
end

c = ActiveSupport::Cache.lookup_store(:mem_cache_store)

# populate some cache entries by doing stuff like 
c.fetch("test10") { "test10result" }
c.fetch("test1") { "test1result1" }
c.fetch("test2") { "test1result2" }
c.fetch("test3") { "test1result3" }

# trigger timeout  during read_multi
Timeout.timeout(5) do 
  c.read_multi(*["test1", "test2", "test3", "test4"])
end

# this should now output the wrong result
c.fetch("test10")

=> "test1result1"

Answer 1 · 2023-03-10T18:20:08.000Z

Thanks for submitting.

So am somewhat skeptical that this is an actual issue, in part because the reproduction code provided uses Timeout to demonstrate the supposed problem. For a discussion of why Timeout is bad, see this post or this post. The first one is by the original author of dalli.

Wrapping I/O code in timeout blocks (as opposed to using built-in connection timeout settings) like that is an anti-pattern and should never be used in production code.

What's happening here is that you're treating the Timeout like a non-fatal error for the underlying connection (which is absolutely should be) and then reusing the connection. So of course the socket is corrupt.

Note that if this were a open, read, or write timeout generated at the connection level it would get properly translated into Dalli::Error subclasses and the underlying connection would be discarded. That's the expected (and correct) behavior.

If you'd like to submit an example that manifests the issue without using Timeout or Thread.raise or any other method of generating an arbitrary exception at a point in code that doesn't kill the Thread and/or process I'd be happy to take a look. But I suspect this is largely an artifact of the use of Timeout.

Answer 2 · 2023-03-10T19:57:45.000Z

Thanks for your response. I agree my examples are arbitrary - their purpose is to demonstrate that an inopportune timeout is sufficient to reproduce the issue, not that one should actually be wrapping individual cache calls in their own Timeouts. In a more realistic production scenario, it could be something like rack-timeout around the entire request.

I hear you that Timeout is fundamentally dicey, you are absolutely correct about its risks. My argument is that this is still worth addressing as defense-in-depth, because this specific failure mode is so severe. Even if we grant that Timeout is a total anti pattern and nobody should use it, it is worth thinking about levels of risk. There are more and less catastrophic outcomes possible with timeouts. A caching system returning incorrect values is among the worst outcomes I can imagine due to the potential for uncontrolled data exposure.

In practice request timeouts are still used in complex systems because it is not possible to rely entirely on low-level I/O and library timeouts; since the total request time is what matters, not any individual operation. And killing any process that takes too long has its own downsides. Given that request timeouts are commonly used, this is far from a purely theoretical risk, it is a failure mode that could - and has - occurred in production systems.

As with any defense-in-depth argument, it is also worth considering unknown-unknowns. Unicorn and Puma have a variety of signals, and it is hard to be certain none of them could ever result in interrupting a process partway through. It goes without saying that in complex/high-traffic systems, weird failure cases are guaranteed to happen eventually.

Further, the defense-in-depth solution I have proposed is an opt-in argument to use an already existing method in the memcached spec - far from radical. Complemented with an approach which locks around the combined write and read components for get_multi. Are there serious downsides to exploring either? Otherwise the balance of risk seems highly favorable to addressing the issue, even though it is a tail risk. I am happy to refine the PRs and add tests as necessary.

Answer 3 · 2023-03-10T20:19:38.000Z

So this is not a response to my objection, and is a misunderstanding of the phrase "defense in depth"

Your "realistic" example represents a serious misunderstanding. With a proper use of rack-timeout this issue does not occur. The problem is not that a timeout occurs. The problem is that you are re-using something AFTER a fatal timeout occurs. It is, to be blunt, a coding error.

To cover the rack-timeout example, let's actually work that through:

Rack starts web request
Dalli::Client is checked out from connection pool (either inside the cache implementation or manually)
Rack triggers a timeout when the Dalli::Client is in use (ergo checked out of the pool)
Dalli::Client is not returned to the pool and IS NEVER USED AGAIN. Only by reusing the Dalli::Client could you possibly encounter the issue you claim.

So if the client is not reused, the error cannot occur. The error is that there's a misguided attempt to recover the Dalli::Client from a discarded request.

In practice request timeouts are still used in complex systems because it is not possible to rely entirely on low-level I/O and library timeouts; since the total request time is what matters, not any individual operation. And killing any process that takes too long has its own downsides. Given that request timeouts are commonly used, this is far from a purely theoretical risk, it is a failure mode that could - and has - occurred in production systems.

Production systems contain lots of bugs. Whoever wrote this one doesn't understand how a request lifecycle should work. If the application discards the request, then the application discard the resources (database connections, API connections, socket connections) that are currently in use and allows its connection pools to regenerate them. If it doesn't do that, then it's simply built wrong.

Honestly, your response has just firmed up my original suspicion that this is a non-issue. It demonstrates a misunderstanding of how to use persistent resources in request-driven systems.

Software systems are not magic, they have rules, and guarding against supposed "unknown unknowns" is not only not feasible, but actively counterproductive.

Moreover, your examples are not "unknown unknowns" - they are active misuse. If a developer chooses to actively misuse a library, it's not the responsibility of the library to save them from themselves.

Closing.

Answer 4 · 2023-03-10T20:34:30.000Z

I'm not sure I understand your point about connection pooling because not all Dalli configurations involve a connection pool, ~~in fact the default one does not per Rails docs~~ [edit -this is changing for Rails 7.1! ]:

By default, :mem_cache_store and :redis_cache_store use a single connection per process.

For an unthreaded server like unicorn, your claim that the connection will never be used again is absolutely not true. There is one connection per process, and the process survives the timeout; so the next request uses the same connection. That's what happens with default Rails <> Unicorn <> Dalli configuration anyway, so it is still a real danger worth addressing.

I would very much be open to alternative approaches that could prevent a connection from being used after a timeout in a non-pooled configuration.

I am sorry if this conversation has frustrated you but I do not understand why you are being so dismissive, or resorting to unnuanced assertions that whoever built the system doesn't understand how things should work. You are arguing that it is only active developer misuse that can cause this issue - I can demonstrate to you that a default configuration setup with rack-timeout, which many many Rails apps use, is sufficient to cause the issue. If the default setups are "misuse" in your view, that is perhaps a valid problem, but it doesn't make the scenario worth ignoring to preserve unsafe defaults.

Answer 5 · 2023-03-15T19:30:05.000Z

Hi @petergoldstein, sorry that we got off on the wrong foot last week. I've come to better appreciate your points about timeouts and request/resource management. It's clear to me now why Timeout.timeout is unsafe, and any request timeout which does not fully restart the process would be playing with fire, even if library/application code tries to make it safer. mperham/connection_pool#75 (comment) was another interesting example of this. So obviously you're right that the library isn't responsible for protecting against timeout misuse. The fundamental safety problem there is the timeouts on their own.

I'm carrying that lesson into some other Rails codebases I'm involved with, so I'm grateful you took the time to engage with me and explain it.

However, I'm now wondering if my misguided timeout examples were just a red herring, because they don't help to explain why I and others saw this issue occur right after upgrading. I found some other historical issues in this repo referencing the same or similar behavior, such as:

#667
#123

I spoke directly with an engineer at another company who had mentioned seeing the same behavior a few years ago, who shared that it happened again recently:

This seems to happen whenever we upgraded the gem. Recently we saw the same behavior during a rails version upgrade.

Unfortunately we did not have a solution but mitigated the issue by ensuring that the different versions of the app did not write to the same cache namespace.

I'm curious if you can think of any reason the version upgrade process itself might cause issues? E.g. subtle formatting differences between versions, such that interacting with entries set by the previous version might have unexpected behavior? Just a shot in the dark there. I'm following the advice above by putting the Dalli version in our cache namespace to effectively bust cache when upgrading, but nonetheless curious about what could be going on.

I also see that there have been issues in the past related to reading get_multi responses which was addressed #321 - but seems similar enough to what I've seen that it's worth mentioning.

Would any of those additional reports influence your evaluation of whether it could be worth exploring some simple logic in Dalli's get implementation to mitigate the possibility of incorrect responses? As the comment on #123 says, debugging is near-impossible with this kind of issue; but I wonder if there might be a simple/general failsafe to pursue either way. Again, it's really unclear if there's a bug in Dalli at all, or how we could find it if there was, so I'm not saying you or the library have done anything wrong - just wondering if we might nonetheless be able to mitigate this rare failure mode for other users of the library.

Thanks again for your insights!

Answer 6 · 2023-03-16T16:08:44.000Z

Honestly, none of the above makes much sense to me. So no.

Answer 7 · 2023-03-17T18:36:02.000Z

Okay, that tracks.

For any readers of this thread who do share my concern about the multiple independent reports of Dalli returning incorrect values after upgrading - or the next team who runs into it and comes looking for a solution: one possible approach is our fork's get implementation which enforces that only values for the correct key will ever be returned. aha-app@f6da276

Answer 8 · 2023-05-08T22:22:18.000Z

This seems like a valid issue to me. There could be many ways to corrupt a socket; even if Dalli is a memcached library where speed is critical, we still should value correctness over speed.

Answer 9 · 2023-05-08T22:40:40.000Z

@petergoldstein If you'd like me to take this one and shepherd it through the PR process, I'd be happy to take it off your plate.

Answer 10 · 2023-05-08T23:27:24.000Z

@mperham I'm sorry, but I wholly disagree that this is a valid issue.

There's no "there", here. No demonstrated issue of any actual corruption. Just a bizarre search through over a decade of past issues in search of something, anything, that might match the pattern.

The original supposed problem - which was basically a simple thread safety issue - isn't even in the mix anymore. There's an "agreement" here that the original problem is simply user error, but the submitter is still asserting that there's a need based on ... what exactly?

Not to mention the architecture issues with the submitted PR, which I noted in my comments (in a much more subdued form)

I closed this quite deliberately, and held my tongue a bit in my responses. But to be clear I firmly believe that merging this code would be an ethical breach inconsistent with my role as maintainer.

Answer 11 · 2023-05-08T23:51:15.000Z

@petergoldstein Forgive me if I'm misunderstanding but his issue seems to be: a Dalli connection can be corrupted in such a way that a request for B can result in the response for A being returned. A timeout signal is one method of corruption and he gave sample code that shows it. I certainly try to avoid Timeout usage myself but I still try to make my gems as reliable as possible even in the face of its usage. Timeout is directly to blame for this grossness in connection_pool.

Socket buffer corruption is a common issue with many network client libraries and that's the purpose of the opaque slot in the memcached request header: Dalli can fill it with a nonce when preparing the request and then verify that the response we get back matches that nonce, otherwise the client and server have gotten out of sync and the connection should be reset. WDYT?

Answer 12 · 2023-05-09T00:16:15.000Z

Sorry for chiming in, I know it's annoying for maintainers to get lots of responses on closed issues like this. I'd just like to propose an alternative solution to cleanly handle asynchronous interrupts.

What redis-rb and now redis-client does is simply to keep a counter of pending reads, increment it before sending a query, and decrementing it after successfully reading a response. This allow to very cheaply validate that the connection isn't in a corrupted state, e.g.: redis-rb/redis-client@5f82254

Perhaps such implementation would be more acceptable to you? I also wholeheartedly agree that Timeout.timeout shouldn't be used, but like Mike I believe that when a library can reasonably detect bad usages like this, it preferable to handle them cleanly, which can be discarding the connection, or throwing an error telling the user they did something wrong.

Answer 13 · 2023-05-09T00:45:20.000Z

@mperham No, I don't quite agree that's what's going on here.

What was initially reported was a case of pure user error. A non-thread safe Dalli client was passed around after a Timeout. Unsurprisingly, it was corrupted. Feel free to read my comment above - I think it's a correct summary. I don't believe Dalli has any responsibility to be "safe" in that situation. And I don't believe that the situation is trivial to detect and/or error on. Not impossible, but it would require some reasonable work. Perhaps this is where we disagree.

A solution was proposed that adds another method to the Client interface and replaces use of memcached's get with getk and "checks" the key on return. I noted my general problems with the proposed implementation - I don't think it's coded well or architected well. I think breaking the abstractions in the gem is a really bad idea. It added a weird side method, implying that the standard method usage was somehow unsafe. And I don't think the tradeoffs make any sense - guarding against this supposed problem becomes very expensive, both in terms of performance and maintainability.

Despite an agreement that the original problem was user error, the proposed solution was nonetheless viewed as a solution to some set of supposed corruption issues dug out of the depths of Dalli issues (3 over a decade of issues). And this one thing he heard from a guy he works with that they always change the cache key when they upgrade Dalli versions (which would have nothing to do with socket corruption, but hey...) so obviously it's all the same problem. And so the code should obviously be merged...

Well, no.

Now, on to your comments. It is certainly possible for a socket to become corrupt. It can happen for reasons other than use error, and it's nice to have tooling that mitigates that. It's even nicer if that mitigation helps in cases of user error. Assuming the tradeoffs aren't too onerous. In my view, given the lack of frequency of the issue, I think the tradeoffs have to be pretty small in terms of maintainability, API coherency, and performance, .

That said, that doesn't mean a mitigating PR is not possible. It would just need to meet that tradeoff threshold.

As you note, in the memcached case the opaque field can be used for this purpose. So one approach is to use that. I'd absolutely consider a PR that:

Adds the optional use of an opaque value across memcached get (and possibly other request types) without requiring new Client methods (I'd even be open to discussing signature changes, although that can be challenging with Rails dependencies)
Works in a mostly consistent fashion across both the (now deprecated) binary and meta protocols.
Integrates with the existing Dalli::Error scheme in a sensible way, and handles the underlying connection cleanly.

@casperisfine I'd also be open to something like you're proposing, although I think it might be more complex here because of memcached pipelining (a naive port of the Redis code seems unlikely to work). But if you want to take a swing at it, that'd be fine too.

Answer 14 · 2023-05-09T15:27:26.000Z

although I think it might be more complex here because of memcached pipelining (a naive port of the Redis code seems unlikely to work).

So I had a quick look, and indeed it will have to work a bit differently, but it should be doable. Contrary to redis, with memcached pipelining you don't know how many responses to expect, but you do know when you are done.

So I think incrementing by one before sending the pipelined command, and then decrementing by one once the pipeline is done reading would work. But again, I just had a quick look.

But if you want to take a swing at it, that'd be fine too.

I'm at Ruby Kaigi for about a week, so @ cornu-ammonis if you feel like trying this help yourself, otherwise I'll try to have a stab at it once I'm back.

Answer 15 · 2023-05-16T07:25:12.000Z

Proposed patch #963

Answer 16 · 2023-06-12T11:01:05.000Z

This is causing catastrophic corruption in our app, the solution is hugely appreciated but can you kindly release a new version? :)

Answer 17 · 2023-06-12T11:04:33.000Z

@rgaufman if it's really catastrophic, surely you can run it as a git gem right away?

Answer 18 · 2023-06-12T12:40:05.000Z

@rgaufman if it's really catastrophic, surely you can run it as a git gem right away?

This is what I’m doing, but a release would be much appreciated as this seems like quite a critical issue that can cause data corruption, information leaks, etc.