petergoldstein/dalli

Loss of Connections after upgrading to dalli 3.2.3

phnx311 opened this issue · 3 comments

Hello team! Thank you for all the work you do maintaining this gem. Our service is experiencing sudden drops in connections to our memcached instance after upgrading to dalli 3.2.3 from 2.7 and changing from DalliStore to MemCacheStore. I've uploaded a screenshot of the CloudFront dashboard that shows these drops in active memcached connections. The drops happen randomly at times (seemingly) and other times, we observe this right after a deployment of new k8s pods. When it does happen it takes hours to recover, but it does eventually recover.
The service is on Rails 6.1, Ruby 2.7.6 (in the process of upgrading) and memcached 1.6.6, the puma server is in cluster mode (2 workers). Not sure if this is relevant, but when we were using DalliStore we did have a block in the puma configuration (puma.rb) that explicitly reset the connections on worker boot like so:

on_worker_boot do  

  Rails.cache.reset  

end 

But that method is no longer available on MemCacheStore so that block was commented out. When we began seeing the dropped connection behavior we automatically attributed it to this commented code so I tried creating a custom cache store that extended MemCacheStore to give access to the underlying Dalli Client like DalliStore used to do by exposing @DaTa through the dalli method. I called the reset method directly on it like this:

on_worker_boot do  

  Rails.cache.dalli.reset  

end  

but continued to see the drops. At this point was just throwing spaghetti on the walls to see what sticks.

Was just wondering if anyone else saw something similar after upgrading. Apologies in advance if this is totally unrelated to the gem upgrade and inappropriate for this forum but this behavior was seen right after upgrading. Hoping for some guidance here if anyone knows what may be happenning.

Screenshot 2023-03-09 at 9 31 13 AM

Hi @phnx311

I'm not sure why this is a concern. I don't have any data on your application, but this looks like typical connection pool behavior.

When you do a deploy, the connections from the pool in the old pod disappear, and the active connection count drops to zero. Connections will start to be created based on load and contention on the pool. Over time, there may be more contention on the pool driving the creation of more connections.

Because connection_pool doesn't reclaim idle connections, these connections will stay open over time. That's why the total number of connections eventually levels out at a maximum. You can configure the maximum size of the pool with the size parameter if you're seeing performance problems.

Random drops may occur if the connections in the pool become invalid (usually for a network related reason) and some connections are dropped from the pool.

Thanks for getting back! Thank you for confirming that this is expected behavior. After I posted this I did see this pattern in the charts for other services as well. For some more context, the application with the issue right now is seeing errors because occasionally a cache key that is being set at the beginning of an action in a controller is nil when read towards the end of the same controller. There is no logic that is clearing or deleting the cache in between those two events.
Again appreciate the confirmation of connectional pool behavior. I will close this issue unless of course you have some insight into the context I gave above.

Closing based on response.