bitwalker/libcluster

possible race condition when ensuring exported functions.

Opened this issue ยท 3 comments

๐Ÿ‘‹

While playing around with libcluster and partisan I keep getting the following error when I kill one of the nodes and then start it again:

21:15:36.313 [error] CRASH REPORT Process <0.311.0> with 0 neighbours crashed with reason: #{'__exception__' => true,'__struct__' => 'Elixir.RuntimeError',message => <<"Elixir.PC.list_nodes/0 is undefined!">>} in 'Elixir.Cluster.Strategy':'ensure_exported!'/3 line 156
21:15:36.314 [error] Supervisor 'Elixir.PC.Cluster.Supervisor' had child 'Elixir.Cluster.Strategy.Gossip' started with 'Elixir.Cluster.Strategy.Gossip':start_link([#{'__struct__' => 'Elixir.Cluster.Strategy.State',config => [],connect => {'Elixir.PC',connect_node,...},...}]) at <0.311.0> exit with reason #{'__exception__' => true,'__struct__' => 'Elixir.RuntimeError',message => <<"Elixir.PC.list_nodes/0 is undefined!">>} in 'Elixir.Cluster.Strategy':'ensure_exported!'/3 line 156 in context child_terminated
21:15:36.314 [error] Supervisor 'Elixir.PC.Cluster.Supervisor' had child 'Elixir.Cluster.Strategy.Gossip' started with 'Elixir.Cluster.Strategy.Gossip':start_link([#{'__struct__' => 'Elixir.Cluster.Strategy.State',config => [],connect => {'Elixir.PC',connect_node,...},...}]) at <0.311.0> exit with reason reached_max_restart_intensity in context shutdown
21:15:36.314 [error] Supervisor 'Elixir.PC.Supervisor' had child 'Elixir.Cluster.Supervisor' started with 'Elixir.Cluster.Supervisor':start_link([[{example,[{strategy,'Elixir.Cluster.Strategy.Gossip'},{connect,{'Elixir.PC',connect_node,[]}},...]}],...]) at <0.305.0> exit with reason shutdown in context child_terminated
21:15:36.316 [error] gen_server <0.313.0> terminated with reason: #{'__exception__' => true,'__struct__' => 'Elixir.RuntimeError',message => <<"Elixir.PC.list_nodes/0 is undefined!">>} in 'Elixir.Cluster.Strategy':'ensure_exported!'/3 line 156

Seems like :libcluster tries to check that PC.list_nodes/0 is exported before the PC is loaded?

#60 and #61 might be related

In my case I am using v3.0 and starting the cluster supervisor.

21:46:51.763 [info] Application partisan started on node bob@mac2

21:46:51.782 [info]  Child Cluster.Strategy.Gossip of Supervisor PC.Cluster.Supervisor started
Pid: #PID<0.314.0>
Start Call: Cluster.Strategy.Gossip.start_link([%Cluster.Strategy.State{config: [], connect: {PC, :connect_node, []}, disconnect: {PC, :disconnect_node, []}, list_nodes: {PC, :list_nodes, []}, meta: nil, topology: :example}])
Restart: :permanent
Shutdown: 5000
Type: :worker

21:46:51.782 [info]  Child Cluster.Supervisor of Supervisor PC.Supervisor started
Pid: #PID<0.313.0>
Start Call: Cluster.Supervisor.start_link([[example: [strategy: Cluster.Strategy.Gossip, connect: {PC, :connect_node, []}, disconnect: {PC, :disconnect_node, []}, list_nodes: {PC, :list_nodes, []}]], [name: PC.Cluster.Supervisor]])
Restart: :permanent
Shutdown: :infinity
Type: :supervisor

21:46:51.783 [info]  Application pc started at :bob@mac2
21:46:51.783 [info] Application pc started on node bob@mac2
Interactive Elixir (1.7.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(bob@mac2)1>
21:46:51.818 [error] GenServer #PID<0.314.0> terminating
** (RuntimeError) Elixir.PC.list_nodes/0 is undefined!
    (libcluster) lib/strategy/strategy.ex:156: Cluster.Strategy.ensure_exported!/3
    (libcluster) lib/strategy/strategy.ex:40: Cluster.Strategy.connect_nodes/4
    (libcluster) lib/strategy/gossip.ex:198: Cluster.Strategy.Gossip.handle_heartbeat/2
    (libcluster) lib/strategy/gossip.ex:140: Cluster.Strategy.Gossip.handle_info/2
    (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: {:udp, #Port<0.11>, {192, 168, 1, 38}, 45892, <<104, 101, 97, 114, 116, 98, 101, 97, 116, 58, 58, 131, 116, 0, 0, 0, 1, 100, 0, 4, 110, 111, 100, 101, 100, 0, 10, 97, 108, 105, 99, 101, 64, 109, 97, 99, 50>>}

This is what I get on bob once I start alice and bob and then kill and restart bob.

As mentioned in the PR, we don't want to hit the code server on every invocation of a callback - instead I would make sure that you call ensure_loaded? in the code which starts the libcluster supervisor, this way it is only ever called once at boot (and it only ever needs to be called once per boot)