phoenixframework/phoenix_pubsub_redis

Presence issues

Closed this issue · 10 comments

Discussed a bit on IRC, opening an issue so we can better document the issues.

Presence using Phoenix.PubSub.Redis has a few issues not present when using Phoenix.PubSub.PG2.

  1. Join and leave events are sent N times, where N is the number of servers.
  2. Zombie presences happen

Is it possible the first is causing the second?

To see the N join and leave events, set up N=2 nodes (simple chat app with presence):

PORT=4000 elixir --sname=n0 -S mix phoenix.server
PORT=4001 elixir --sname=n1 -S mix phoenix.server

Go to a browser and trigger a join event. You should see 2 identical join messages.

I think this first issue might be causing other issues so I'm hesitant to weigh in on why zombie processes are still happening (even with named nodes and homogenous hardware).

For zombie presences, here's the Presence State:

%{broadcast_period: 1500, clock_sample_periods: 2, current_sample_count: 1,
  deltas: [%Phoenix.Tracker.State{cloud: #MapSet<[{{:channel@eced99b58fe5,
       1475712441601646}, 1}]>, context: %{}, delta: :unset, mode: :delta,
    pids: nil,
    range: {%{{:channel@eced99b58fe5, 1475712441601646} => 1},
     %{{:channel@a922f17b5f9c, 1475712442528618} => 0,
       {:channel@ac4b42b2277a, 1475712441525883} => 0,
       {:channel@eced99b58fe5, 1475712441601646} => 1,
       {:investigate@e47d34e9b784, 1475711012624954} => 0}},
    replica: {:investigate@e47d34e9b784, 1475711012624954}, replicas: %{},
    values: %{{{:channel@eced99b58fe5, 1475712441601646},
       1} => {#PID<15527.515.0>, "users:hack_demo", "Peter",
       %{online_at: "1475712812", phx_ref: "TiolORFe+vk="}}}},
   %Phoenix.Tracker.State{cloud: #MapSet<[{{:channel@eced99b58fe5,
       1475712441601646}, 1}]>, context: %{}, delta: :unset, mode: :delta,
    pids: nil,
    range: {%{{:channel@eced99b58fe5, 1475712441601646} => 1},
     %{{:channel@a922f17b5f9c, 1475712442528618} => 0,
       {:channel@ac4b42b2277a, 1475712441525883} => 0,
       {:channel@eced99b58fe5, 1475712441601646} => 1,
       {:investigate@e47d34e9b784, 1475711012624954} => 0}},
    replica: {:investigate@e47d34e9b784, 1475711012624954}, replicas: %{},
    values: %{{{:channel@eced99b58fe5, 1475712441601646},
       1} => {#PID<15527.515.0>, "users:hack_demo", "Peter",
       %{online_at: "1475712812", phx_ref: "TiolORFe+vk="}}}},
   %Phoenix.Tracker.State{cloud: #MapSet<[{{:channel@eced99b58fe5,
       1475712441601646}, 1}]>, context: %{}, delta: :unset, mode: :delta,
    pids: nil,
    range: {%{{:channel@eced99b58fe5, 1475712441601646} => 1},
     %{{:channel@a922f17b5f9c, 1475712442528618} => 0,
       {:channel@ac4b42b2277a, 1475712441525883} => 0,
       {:channel@eced99b58fe5, 1475712441601646} => 1,
       {:investigate@e47d34e9b784, 1475711012624954} => 0}},
    replica: {:investigate@e47d34e9b784, 1475711012624954}, replicas: %{},
    values: %{{{:channel@eced99b58fe5, 1475712441601646},
       1} => {#PID<15527.515.0>, "users:hack_demo", "Peter",
       %{online_at: "1475712812", phx_ref: "TiolORFe+vk="}}}}],
  down_period: 30000, log_level: false, max_delta_sizes: [100, 1000, 10000],
  max_silent_periods: 10,
  namespaced_topic: "phx_presence:Elixir.R101Channel.Presence",
  pending_clockset: [], permdown_period: 60000,
  presences: %Phoenix.Tracker.State{cloud: #MapSet<[{{:channel@f25f7d246ff5,
      1475710664658338}, 2187},
    {{:channel@90279305e724, 1475710676205113}, 842},
    {{:channel@f25f7d246ff5, 1475710664658338}, 7904},
    {{:channel@90279305e724, 1475710676205113}, 8847},
    {{:channel@f25f7d246ff5, 1475710664658338}, 5201},
    {{:channel@f25f7d246ff5, 1475710664658338}, 6383},
    {{:channel@f25f7d246ff5, 1475710664658338}, 2713},
    {{:channel@f25f7d246ff5, 1475710664658338}, 7019},
    {{:channel@f25f7d246ff5, 1475710664658338}, 3473},
    {{:channel@90279305e724, 1475710676205113}, 9132},
    {{:channel@90279305e724, 1475710676205113}, 9045},
    {{:channel@f25f7d246ff5, 1475710664658338}, 4593},
    {{:channel@90279305e724, 1475710676205113}, 929},
    {{:channel@f25f7d246ff5, 1475710664658338}, 4915},
    {{:channel@f25f7d246ff5, 1475710664658338}, 6333},
    {{:channel@f25f7d246ff5, 1475710664658338}, 7240},
    {{:channel@f25f7d246ff5, 1475710664658338}, 5938},
    {{:channel@f25f7d246ff5, 1475710664658338}, 3460},
    {{:channel@f25f7d246ff5, 1475710664658338}, 6564},
    {{:channel@f25f7d246ff5, 1475710664658338}, 4027},
    {{:channel@f25f7d246ff5, 1475710664658338}, 2329},
    {{:channel@90279305e724, 1475710676205113}, 940},
    {{:channel@90279305e724, 1475710676205113}, 9106},
    {{:channel@f25f7d246ff5, 1475710664658338}, 4561},
    {{:channel@f25f7d246ff5, 1475710664658338}, 4552},
    {{:channel@f25f7d246ff5, 1475710664658338}, 4161},
    {{:channel@f25f7d246ff5, 1475710664658338}, 2865},
    {{:channel@f25f7d246ff5, 1475710664658338}, 5098},
    {{:channel@f25f7d246ff5, 1475710664658338}, 3272},
    {{:channel@90279305e724, 1475710676205113}, 7900},
    {{:channel@f25f7d246ff5, 1475710664658338}, 7829},
    {{:channel@90279305e724, 1475710676205113}, 7987},
    {{:channel@90279305e724, 1475710676205113}, 700},
    {{:channel@f25f7d246ff5, 1475710664658338}, 5344},
    {{:channel@f25f7d246ff5, ...}, 6179}, {{...}, ...}, {...}, ...]>,
   context: %{{:channel@1230710ca876, 1475709456736068} => 636,
     {:channel@1d67fed32749, 1475709858391608} => 513,
     {:channel@2193f1e81223, 1475709458421283} => 749,
     {:channel@2193f1e81223, 1475709830534507} => 1269,
     {:channel@312cd1ab5871, 1475710199953019} => 211,
     {:channel@44a4265186d9, 1475710424384607} => 199,
     {:channel@4e1f755db374, 1475710339126267} => 198,
     {:channel@4f32a7167c98, 1475709456744647} => 1,
     {:channel@4f32a7167c98, 1475709793434868} => 1403,
     {:channel@4f32a7167c98, 1475710396881144} => 14,
     {:channel@7ea3df244f3a, 1475709876664304} => 582,
     {:channel@90279305e724, 1475711709297191} => 185,
     {:channel@ad8896bae006, 1475710374476862} => 218,
     {:channel@b3683ad2658f, 1475709856857074} => 445,
     {:channel@b3683ad2658f, 1475710179361681} => 3,
     {:channel@c4db3f690254, 1475710375161698} => 233,
     {:channel@c58a4fdba29c, 1475710508256317} => 153,
     {:channel@c60a125a16a0, 1475711674305976} => 20,
     {:channel@c60a125a16a0, 1475711704719175} => 913,
     {:channel@c7c21a1d21a0, 1475710473246353} => 440,
     {:channel@e947e1c3bbd3, 1475710210663183} => 162,
     {:channel@eced99b58fe5, 1475712441601646} => 1,
     {:channel@f9087a39a66c, 1475710258113774} => 153},
   delta: %Phoenix.Tracker.State{cloud: #MapSet<[]>, context: %{},
    delta: :unset, mode: :delta, pids: nil,
    range: {%{}, %{{:investigate@e47d34e9b784, 1475711012624954} => 0}},
    replica: {:investigate@e47d34e9b784, 1475711012624954}, replicas: %{},
    values: %{}}, mode: :normal, pids: 114735, range: {%{}, %{}},
   replica: {:investigate@e47d34e9b784, 1475711012624954},
   replicas: %{{:channel@25464af1e95d, 1475710592789017} => :down,
     {:channel@90279305e724, 1475710676205113} => :down,
     {:channel@90279305e724, 1475711709297191} => :down,
     {:channel@90279305e724, 1475711721205996} => :down,
     {:channel@a922f17b5f9c, 1475712442528618} => :up,
     {:channel@ac4b42b2277a, 1475712441525883} => :up,
     {:channel@c5eb6b7ddf3a, 1475710544610229} => :down,
     {:channel@c60a125a16a0, 1475710628965946} => :down,
     {:channel@c60a125a16a0, 1475711674305976} => :down,
     {:channel@c60a125a16a0, 1475711685946217} => :down,
     {:channel@c60a125a16a0, 1475711704719175} => :down,
     {:channel@c60a125a16a0, 1475711721272809} => :down,
     {:channel@eced99b58fe5, 1475712441601646} => :up,
     {:channel@f25f7d246ff5, 1475710664658338} => :down,
     {:channel@f25f7d246ff5, 1475711694314423} => :down,
     {:investigate@e47d34e9b784, 1475711012624954} => :up}, values: 110638},
  pubsub_server: R101Channel.PubSub,
  replica: %Phoenix.Tracker.Replica{last_heartbeat_at: nil,
   name: :investigate@e47d34e9b784, status: :up, vsn: 1475711012624954},
  replicas: %{channel@a922f17b5f9c: %Phoenix.Tracker.Replica{last_heartbeat_at: 1475712954519,
     name: :channel@a922f17b5f9c, status: :up, vsn: 1475712442528618},
    channel@ac4b42b2277a: %Phoenix.Tracker.Replica{last_heartbeat_at: 1475712953576,
     name: :channel@ac4b42b2277a, status: :up, vsn: 1475712441525883},
    channel@eced99b58fe5: %Phoenix.Tracker.Replica{last_heartbeat_at: 1475712961211,
     name: :channel@eced99b58fe5, status: :up, vsn: 1475712441601646}},
  server_name: R101Channel.Presence, silent_periods: 1,
  tracker: R101Channel.Presence,
  tracker_state: %{node_name: :investigate@e47d34e9b784,
    pubsub_server: R101Channel.PubSub,
    task_sup: R101Channel.Presence.TaskSupervisor}}


iex(investigate@e47d34e9b784)139> R101Channel.Presence.list("users:hack_demo") |> Map.get("5201")
%{metas: [%{online_at: "1475710419", phx_ref: "i4Hm4hinP0g="}]}

There are a bunch of presences from {:channel@f25f7d246ff5, 1475710664658338}, which now no longer exists. User "5201" has been disconnected, but a presence still exists

The zombie presences seem to only occur when a process hits OOM. Going to try to reproduce with PG2.

Also, is there a way to manually clear out zombie presences? Stopping all hosts is less than ideal. We build all our infrastructure tools to do rolling restarts and that just propogates the bad state.

@hamiltop a bit late on my part, but is this still an issue for you?

I haven't tried since a year ago. I'll give it a shot again and see.

@hamiltop that would be fantastic 🙂

Any changes here?

@KamilLelonek Are you seeing this issue?

I used to when I was using it. Now I'm planning to do that again and I wonder whether it will happen.

#28 mentions that they didn't see any zombie presences, so it may be fixed.

I unfortunately won't have time to try to reproduce this week, so I can't confirm whether or not the issue persists. If you are able to confirm, I can take a closer look on a fix 🙂

Without a way to reproduce, I’m going to close this issue