matrix-org/synapse

Presence is increasingly heavy

spantaleev opened this issue ยท 19 comments

I'm not sure if #3962 is completely different, but this is something that I've been noticing for a while..

It seems like sending out presence updates whenever my presence changes, would cause 100% CPU usage for a while on my small VPS.

An easy way to trigger it is to use the riot-ios app. Just opening the app and then sending it to the background would cause it to hit /_matrix/client/r0/presence/{userId}/status with PUT requests (either updating presence to online or to unavailable.. and after some more inactivity, to offline by a Synapse background task, it seems).

Doing that would cause 100% CPU for a while. I imagine Synapse tries to notify many other servers about the presence change. While this (non-important thing) is going on, Synapse would be slow to respond to other requests.

If I just keep alternating between backgrounding and foregrounding the riot-ios app, I can effectively keep my homeserver at 100% CPU.

Normally though, a few seconds after backgrounding the app (which sets my presence as unavailable), due to a subsequent foregrounding of the app or due to a /sync by another client of mine (on desktop or something), my presence status would change back to online and cause the same thing once again.

Maybe a few things could be investigated:

  • whether riot-ios should try to set presence as unavailable at all, especially given that other clients may be syncing at the same time and telling Synapse I'm online..

  • even if a given client says unavailable, whether the server should accept that, given that other clients (devices) may be syncing and setting another status at the same time

  • whether the server should be so quick to accept and propagate a presence status, when said status might change once again some couple of seconds later. /sync is usually called by clients with a long-polling timeout of 30 seconds, so there usually may be something that re-sets the presence status after as little as 30 seconds. Do federated clients care about sub-30-seconds granularity? Perhaps presence changes can be debounced for 30-60 seconds before actually kicking in

  • whether propagating presence to other servers should be so heavy. Perhaps the code can be optimized or deprioritized, such that it won't disturb other server operations


Using use_presence = false eliminates the problem, at the expense of presence not working.

I do like having presence, but I don't care about it being so accurate and so fast to propagate (at the expense of other operations).

Hi @spantaleev thanks for detailed issue. I think you're probably right that we are too keen to send out presence updates to other servers in the room.

I don't have an immediate fix, but one thing you might do is check your extremities per #1760 because the problem is that we can end up doing state res for all your rooms to see who's in there, which is expensive if you have lots of extremities

Thanks for the tip, @neilisfragile!

I have checked for forward extremities using:

select room_id, count(*) c from event_forward_extremities group by room_id order by c desc limit 20;

The top result has a count of 2, so I'm probably not affected by it.

Maybe the slowness comes from something else?

Agreed, 2 doesn't sound like a lot.

I can see some conversation in #SynapseAdmins, though it's not clear to me that it solved your problem.

Practically speaking I can't say we'll look at presence in detail in the short term - though we certainly are looking at state resolution and perf more generally which is likely to have a knock on effect.

At this point the best I can suggest is that you prune the extremities anyway for the reasons that @richvdh highlighted, and see how that affects presence perf if at all.

Today's conversation in #SynapseAdmins also provoked me to look into #1760 once again.

Using the query at the bottom there, generates a completely empty extrems_to_delete table for me.

Interesting.. If there are no extremities to delete, it suggests I'm not affected by the extremities problem, and that presence is slow for me for other reasons.

Has this improved since #4942? I'd imagine that the root cause is the same as the bug fixed there (#3962).

I just gave it a try with Synapse v1.1.0.

Looks like it might be better. It's hard to say though, because I happen to be on a new and faster server now.


CPU-wise

When presence is disabled, my load15 average would be 0.1x throughout the day.

When presence is enabled, it seems like load15 average is around 0.4-0.5 with normal use.

CPU usage hits 100% for a significant amount of time when the presence status changes (foregrounding/backgrounding the app). By foregrounding and backgrounding the app periodically in quick succession, I could consistently keep my server's load average above 2.0 (reaching up to 4.0).


Memory-wise

Memory usage is definitely much higher when presence is enabled.

Testing method:

  • restarting Synapse
  • waiting for it to settle for a minute
  • triggering a few presence changes (foregrounding/backgrounding the riot-ios app)

This doesn't say how memory usage grows over a long period of time (hours, days).. it definitely does even with presence disabled, but we ignore it here.

This is on a 4G RAM server.

When presence is disabled, after some 10 minutes of use, memory usage for Synapse keeps stable at 5%.

When presence is enabled, memory usage for Synapse quickly jumps from 5% to ~35% and stays there.


In summary, I guess the situation might have improved, but:

  • it's still possible for a single user to easily DoS a server just by foregrounding/backgrounding the app
  • even without going to such extremes (doing rapid presence status changes), CPU usage seems to be about 4-5 times higher when presence is enabled and the server is used normally
  • memory usage is substantially higher (~7x increase) with presence enabled

I also have noticed a significant difference in load on the system when presence is enabled. Lots of load happens if I switch back and forth on the app. Same as mentioned above. Ram usage, I'm not sure about, but seems to be higher as well.

One idea I had, because the presence being so heavy is bugging me: Could presence only be returned if a sync otherwise succeeds, would timeout or a certain time threshold is passed? It doesn't really make sense to always return just one presence update on /sync. In my opinion it should be fine to add up to 30 seconds delay to a presence update, if that is the only update a user would get for presence. If sync returns earlier, because a message was sent for example, it should be easy enough to also flush the presence updates out with it.

I know matrix.org doesn't use that feature, but I would really like to be able to use presence, without my server and client going up in flames.

From personal experience, disabling presence on my homeserver had a significant effect on my overall experience using matrix. /sync's on mobile went from taking 10+ seconds (sometimes timing out at 30 seconds), to happening instantaneously. Since presence seems to be the main source of slow synapse performance, does it make sense to have it be opt-in instead of opt-out? I can't count the number of times I've helped people with their server performance by toggling that setting.

Yeah, disabling presence dramatically decrease CPU load of our public ru-matrix.org server! Our server have about 10-20 active local users with about 100 generated events (messages) per day, so seems most of load give outgoing presence traffic for external users, that members of large rooms like Matrix HQ, here is relevant charts after disabling presence:
image
image
image
image
image
Hope they helps with analysis source of high load.

As workaround, maybe make presence quering rules as per-room basis? For ability to disable presence querying events in large rooms like Matrix HQ, Riot-*, etc. This will be much better, than globally disabling presence!

Other possible workaround - make option in Synapse to disable sending presence for all federated users, or whitelist of servers to which send presence, and add rate limits (custom incoming and default outgoing) for protect ddos of homeserver from other homeservers.

I figured out, that after enabling presence - first day all works well, but after one-two days - load is slowly increased, and after 4-5 days - Synapse becomes very slow! So I goes to disable it again.

Here is example of relevant charts enabling presence at 2020-06-11 21:10:00:
image
image
image
image
image
image

image
image

Other possible workaround - make option in Synapse to disable sending presence for all federated users, or whitelist of servers to which send presence

This is something I'd love to see, personally. I want presence on my homeservers and on my friends' but I don't want it at all on the big matrix.org rooms. I would also love to be able to do something like disable presence for any room with more than X members.

Also presence functions don't lock the process for concurrent changes, this throws the error:

2020-09-24 13:09:52,385 - synapse.metrics.background_process_metrics - 210 - ERROR - persist_presence_changes-253 - Background process 'persist_presence_changes' threw an exception
Traceback (most recent call last):
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/metrics/background_process_metrics.py", line 205, in run
    result = await result
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/handlers/presence.py", line 350, in _persist_unpersisted_changes
    [self.user_to_current_state[user_id] for user_id in unpersisted]
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/storage/databases/main/presence.py", line 35, in update_presence
    presence_states,
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/storage/database.py", line 541, in runInteraction
    **kwargs
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/storage/database.py", line 590, in runWithConnection
    self._db_pool.runWithConnection(inner_func, *args, **kwargs)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/twisted/python/threadpool.py", line 250, in inContext
    result = inContext.theWork()
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/twisted/python/threadpool.py", line 266, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/twisted/enterprise/adbapi.py", line 306, in _runWithConnection
    compat.reraise(excValue, excTraceback)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/twisted/python/compat.py", line 464, in reraise
    raise exception.with_traceback(traceback)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/twisted/enterprise/adbapi.py", line 297, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/storage/database.py", line 587, in inner_func
    return func(conn, *args, **kwargs)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/storage/database.py", line 429, in new_transaction
    r = func(cursor, *args, **kwargs)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/storage/databases/main/presence.py", line 73, in _update_presence_txn
    txn.execute(sql + clause, [stream_id] + list(args))
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/storage/database.py", line 212, in execute
    self._do_execute(self.txn.execute, sql, *args)
  File "/opt/venvs/matrix-synapse/lib/python3.6/site-packages/synapse/storage/database.py", line 238, in _do_execute
    return func(sql, *args)
psycopg2.errors.SerializationFailure: could not serialize access due to concurrent update

Yep the Problem is poor on my small Dedicated Server, I love Presence Stuff but the impact is very high for me too with Master Synapse Branch -> Python 3.9, I think Presence only internal Homeserver could solve some problems, so currently it is a global feature, why it's not configureable to internal <-> Federation ?

We've done a bunch of work in this area over the past few months, so hopefully its improved a bit. I'm sure there is more work that we can do, but I'm going to close this for now

@erikjohnston can you please describe a little about that improvements, or post link to PRs?

The changelog has a list of changes if you search for presence, but some PRs are:

#10398
#10163
#10165
#9910
#9916