bitwalker/libcluster

Q: How to run locally via Epmd strategy?

elvanja opened this issue · 4 comments

Hi,

I've been trying to run locally via Epmd strategy but can't get it to work.
This is how I start two local nodes:

➜  HTTP_PORT=4001 iex --cookie test --name test1@127.0.0.1 -S mix phx.server
➜  HTTP_PORT=4002 iex --cookie test --name test2@127.0.0.1 -S mix phx.server

Inspecting the epmd service, it is started and gives:

➜  epmd -names
epmd: up and running on port 4369 with data:
name test2 at port 52487
name test1 at port 52477

And this is the topology configuration:

config :libcluster,
  topologies: [
    local: [
      strategy: Cluster.Strategy.Epmd,
      config: [:hosts, [:"test1@127.0.0.1", :"test2@127.0.0.1"]],

The cluster supervisor is the first one in application child specification:

topologies = Confex.fetch_env!(:libcluster, :topologies)
children = [
  {Cluster.Supervisor, [topologies, [name: MyApp.ClusterSupervisor]]},
  ...
]

Still, Node.list() returns an empty list on each node. Node.connect/1 works normally.
I also tried various iterations on how I start the nodes and put them in config, e.g.:

  • --name test1@localhost when starting
  • --sname test1 when starting
  • :test1 in config
  • :test1@localhost in config

But, neither of those is successful. I also tried making the connect, disconnect and list_nodes configuration options explicit, took them from project's readme, without success.

Am I missing something here? Maybe you could share parts of your local setup (I see in #10 that you use that same strategy locally)?

Also, do note that gossip strategy works like a charm, but my problem is that I'd like to be able to simulate network split conditions, and with gossip strategy, nodes automatically reconnect. So I thought epmd would be a better choice. Maybe my assumption here is also wrong, maybe it would reconnect automatically with it too, but then again, couldn't make it work to verify. If you have an idea on how to simulate network split that would be cool too.

That's it, thank you!

P.S. I'm on latest release of libcluster, and using:

  • elixir 1.9.1-otp-22
  • erlang 22.0.3

Update: managed to accomplish the main goal of disconnecting/reconnecting nodes as desired via gossip protocol, like this:

# to disconnect and keep disconnected
Supervisor.terminate_child(Abbr.ClusterSupervisor, :local)
Cluster.Strategy.disconnect_nodes(:local, {:erlang, :disconnect_node, []}, {:erlang, :nodes, [:connected]}, Node.list())

# to reconnect
Supervisor.restart_child(Abbr.ClusterSupervisor, :local)

So as far as that goal goes, you can close this issue/question. That being said, I'd still like to know how to make the Epmd idea work locally.

What do you see in your logs with the EPMD strategy? I would expect to see some indication that a connection succeeded/was ignored/failed if there are nodes in the node list when the strategy starts. The EPMD strategy will only attempt to connect to the hosts in hosts when it first starts, it doesn't continue trying; but the last node to start will ensure that all the other nodes get connected, since the Erlang node connections form a mesh.

It may also help to understand if you are starting them at the same time, or starting one node, then the other. But on the surface here, I don't see anything that catches my eye as incorrect. The logs will help clarify the situation though (make sure you have debug logging on).

With debugging turned on via config:

config :libcluster,
  debug: true,
  topologies: [...

There's no trace of any kind of logs from libcluster. I tried development and production modes for the ap (didn't forget to set log level to debug for production run too). I was careful to start one instance and then the other after the first one was started completely. epmd -names always sees both instances, but they don't connect automatically. If I call &Node.connect/1 from REPL, it works as expected. In both dev and production modes the start params included --cookie test --name test1@127.0.0.1. If it helps, I'm also using Phoenix.

Update, managed to get it to work. In the end it was all about the instance name and it's representation in topology configuration. So instead of using config: [:hosts, [:"test1@127.0.0.1", :"test2@127.0.0.1"]] it needed to be config: [hosts: [:test1@my_computer_name, :test2@my_computer_name]]. Then it all worked as expected. 🤦‍♂ Anyway, thank you for your help!