openstreetmap/operations

There's still a need to bump the memcache size

jidanni opened this issue · 23 comments

Hello. In openstreetmap/openstreetmap-website#2457 I was told to open an issue here. But as it is getting a little over my head, I will just leave this here.

There is no evidence at all in the graphs that this in fact an issue. I definitely see the issue that you are referring to but I am unable it as all evidence says it shouldn't be down to memcache.

Could these sessions disconnections be caused by server restarts or does the server never restart?

I don't see any server restart in the stats, at least for the last 6 months: https://prometheus.openstreetmap.org/d/l4zgNUdMz/memcached?orgId=1&refresh=1m&from=now-6M&to=now

Also, the OP didn't provide any details how frequently they have to log in again. There might be external factors, like cookies being removed by the browser or some browser extension, etc.

I thought everybody else also has to login again at least once every three or four days.
Maybe it's because I use various browsers on various devices. But why on the same device do I need to login again after three or four days?
Anyways welcome to check the logs to see why user jidanni has to login again so often.

Which stat do you use to check if the server restarted?
Aren't these sudden drops in memory usage symptoms of a server restart?
Note that the dates are in the format month/day.
Copie d'écran_20240707_143049m

Ah, the link wasn't that helpful. There are about 11 memcached instances overall. However, for the 3 frontend servers, only 3 memcached instances (spike-06 ... spike-08) are relevant. Items in cache and memory usage are fairly stable for these three.

https://prometheus.openstreetmap.org/d/l4zgNUdMz/memcached?orgId=1&refresh=1m&from=now-6M&to=now&var-instance=spike-06&var-instance=spike-07&var-instance=spike-08

I think this should match the following config in chef: https://github.com/openstreetmap/chef/blob/45dc24b65b23a6c1dcc2f0ba2aa971563555c35e/roles/web.rb#L20

A restart would indeed lose all sessions but as @mmd-osm says it's only those three machines that we're talking about here and they last restarted in November last year:

image

At that time it took nearly two months for the caches to fill up which suggests that it should take about that long for things to get expired unless there has been a significant increase in the cache usage since.

The eviction rate has increased since November but it hasn't consisntently bee more than double. commands/second has remaind the same

I logged back in 5 days ago: 1 day later my session was still active but today I'm logged out.
We can also see a dip today from ~100 millions items in cache to ~66 millions.

I suggest to store the sessions in the DB and use memcache only to speed up sessions check for frequently used sessions.

One of the machines was rebooted yesterday while fighting the DDOS so 1/3 of the the cache entries were lost.

I'm wondering how many of these entries originate from CGImap (key prefix would be "cgimap:"). For some reason, these entries have the expiration value set to 0 (unlimited). This doesn't make a whole lot of sense for rate limiting requests, where the exact timestamp would be known upfront at which time these entries become irrelevant.

At least when testing locally, I've noticed that every anonymous user creates a rails session without expiry (that's the "0" in "1 0 73" below), whereas logged in users have an entry with 4-5 weeks expiration.

Anonymous user sessions:

/usr/share/memcached/scripts/memcached-tool localhost:11211 dump 
Dumping memcache contents
add rails:session:2::2d28d018bdda81f05bae57ba42ee200a7a14af6df74134bb93ee82f99bf7baab 1 0 73
{I"_csrf_token:EFI"096xa2ms9DVncEF7CBUeBJ0wP9VYJrKO6lzxqDomep74;F

Logged in user:

Expires at 1723288155 = Sat Aug 10 13:09:15 CEST 2024

add rails:session:2::2d28d018bdda81f05bae57ba42ee200a7a14af6df74134bb93ee82f99bf7baab 1 1723288155 200
{	I"_csrf_token:EFI"096xa2ms9DVncEF7CBUeBJ0wP9VYJrKO6lzxqDomep74;FI"	user;FiI"fingerprint;FI"E....

Expiry shouldn't really matter that much because anything that isn't used just moves down the LRU list and gets discarded eventually when we need space for a new entry.

Logged in sessions (with "remember me" checked) do get an expiry of 28 days which matches the cookie expiry while other sessions (not logged in and logged in without "remember me" checked) actually don't have an expiry but issue a session cookie that expires when the browser is closed.

First of all, I find it a bit difficult to reason about the logged in sessions based on Prometheus stats, in particular after how many days these entries would be discarded.

memcached has an LRU crawler which reclaims expired entries even before they're reaching the end of the LRU list. With a non zero TTL, we might get rid of many "non-logged in user" entries early on, before they might evict "logged in user" entries.

At the current growth rate, we will likely see some evictions in about 10 days (=21 days after last memcached restart).

@jidanni : did you notice any issues with lost login sessions in the last 8-9 days? If so, it can’t be memcached related…

It's not that simple because only one machine was reset I think? So only keys which hash to that machine are currently exempt from being evicted.

I think spike-06..08 were all restarted, the aggregated cached items count on Prometheus shows 0 entries about 10 days ago.

@mmd-osm rather than using my misty memory,
surely there must be some internal logs you can check regarding me (user: jidanni)
that can give you precise details.

We want to hear from you first hand, as you’ve also raised the issue. Misty memory is ok. If you say it hasn’t bothered you recently then that’s good enough for now.

What we see in the charts right now is that no entries are being removed. So chances are that your session is still around.

Okay. I will remember next time to report each and every incident right here to the thread.

Okay. Just had to log in again as you can see in your logs perhaps.

Thank you for the feedback. This is not completely unexpected. Evicting entries started again on August 1st, even a bit sooner than estimated.

On a laptop I hadn't used in five days:
Had to login again to OSM.
But didn't need to login again to GitHub to add this comment.