Supervisor Strategies in Elixir
One of the things that makes OTP and Elixir unique is the model of supervisor behaviour that applications can take with different processes they start. In this post we will examine each of the three available in Elixir by making a supervised app.
To start, we make a supervised application:
mix new counter --sup
cd counter
Now that we have an app, we are going to create 3 modules.
They will all be GenServers that are started with the application who send themselves a message every second to increment their state by one.
One will always work, one will fail every 6 messages, and one will fail every 20 messages.
To start, it will have the default supervisory strategy of one_for_one
in application.ex
.
This strategy says that if one process dies, its siblings should stay working unaffected.
Note: No matter your supervisory strategy, if the children in your app do not succeed on start_link
and return an {:ok, pid}
tuple, the application as a whole will not start and your supervisory strategy does not matter at all.
We will stick with that in the beginning.
Let's start with the first module in lib/counter/one.ex
.
It will fail if its state is 22.
defmodule Counter.One do
use GenServer
def start_link(_state \\ 0) do
IO.inspect("starting", label: "Counter.One")
success = GenServer.start_link(__MODULE__, 0)
IO.inspect("started", label: "Counter.One")
success
end
@impl true
def init(state) do
work(state)
# Schedule work to be performed on start
schedule_work()
{:ok, state}
end
@impl true
def handle_info(:work, state) do
work(state)
# Reschedule once more
schedule_work()
{:noreply, state + 1}
end
defp schedule_work() do
Process.send_after(self(), :work, 1000)
end
def work(state) do
case state do
22 -> raise "I'm Counter.One and I'm gonna error now"
_ -> IO.inspect("working and my state is #{state}", label: "Counter.One")
end
end
end
Note:
This is slight modification of a great example from the GenServer docs.
Also see this past Elixir School blog post for more on Process.send_after/3
.
Now, if we open lib/counter/application.ex
and add it to children, we can get it to start with our app:
defmodule Counter.Application do
# See https://hexdocs.pm/elixir/Application.html
# for more information on OTP Applications
@moduledoc false
use Application
def start(_type, _args) do
# List all child processes to be supervised
children = [
Counter.One
]
# See https://hexdocs.pm/elixir/Supervisor.html
# for other strategies and supported options
opts = [strategy: :one_for_one, name: Counter.Supervisor]
Supervisor.start_link(children, opts)
end
end
Now if we start the app, we will see it begin to work and fail at 22:
Counter.One: "working and my state is 18"
Counter.One: "working and my state is 19"
Counter.One: "working and my state is 20"
Counter.One: "working and my state is 21"
Counter.One: "starting"
Counter.One: "working and my state is 0"
Counter.One: "started"
18:27:42.566 [error] GenServer #PID<0.119.0> terminating
** (RuntimeError) I'm Counter.One and I'm gonna error now
(one) lib/counter/one.ex:33: Counter.One.work/1
(one) lib/counter/one.ex:21: Counter.One.handle_info/2
(stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:686: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: :work
State: 22
Counter.One: "working and my state is 0"
Counter.One: "working and my state is 1"
This fails because we made a specific clause to coerce failure by raising an error when the state reached 22
in our counter.
It is restarted with a state 0 (the default) after this failure.
Now lets make another module that will never fail:
defmodule Counter.Two do
use GenServer
def start_link(_state \\ 0) do
IO.inspect("starting", label: "Counter.Two")
success = GenServer.start_link(__MODULE__, 0)
IO.inspect("started", label: "Counter.Two")
success
end
@impl true
def init(state) do
work(state)
# Schedule work to be performed on start
schedule_work()
{:ok, state}
end
@impl true
def handle_info(:work, state) do
work(state)
# Reschedule once more
schedule_work()
{:noreply, state + 1}
end
defp schedule_work() do
Process.send_after(self(), :work, 1000)
end
def work(state) do
IO.inspect("working and my state is #{state}", label: "Counter.Two")
end
end
We can add it to lib/counter/application.ex
as well.
# ...
def start(_type, _args) do
# List all child processes to be supervised
children = [
Counter.One,
Counter.Two
]
end
# ...
Now, for our third and final module that will fail if state is 5
.
defmodule Counter.Three do
use GenServer
def start_link(_state \\ 0) do
IO.inspect("starting", label: "Counter.Three")
success = GenServer.start_link(__MODULE__, 0)
IO.inspect("started", label: "Counter.Three")
success
end
@impl true
def init(state) do
work(state)
# Schedule work to be performed on start
schedule_work()
{:ok, state}
end
@impl true
def handle_info(:work, state) do
work(state)
# Reschedule once more
schedule_work()
{:noreply, state + 1}
end
defp schedule_work() do
Process.send_after(self(), :work, 1000)
end
def work(state) do
case state do
5 -> raise "I'm Counter.Three and I'm gonna error now"
_ -> IO.inspect("working and my state is #{state}", label: "Counter.Three")
end
end
end
We can add it to lib/counter/application.ex
in the list of children, last after the other two:
# ...
def start(_type, _args) do
# List all child processes to be supervised
children = [
Counter.One,
Counter.Two,
Counter.Three
]
end
# ...
One For One
Now, let's start our application and see the failure behaviour and state for each GenServer. These logs are truncated to just the interesting parts.
Counter.One: "working and my state is 4"
Counter.Two: "working and my state is 4"
Counter.Three: "working and my state is 4"
Counter.Two: "working and my state is 5"
Counter.Three: "working and my state is 5"
Counter.One: "starting"
Counter.One: "working and my state is 0"
Counter.One: "started"
18:11:37.495 [error] GenServer #PID<0.130.0> terminating
** (RuntimeError) I'm Counter.One and I'm gonna error now
(counter) lib/counter/one.ex:33: Counter.One.work/1
(counter) lib/counter/one.ex:21: Counter.One.handle_info/2
(stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:686: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: :work
State: 5
Counter.Three: "working and my state is 6"
Counter.Two: "working and my state is 6"
Counter.One: "working and my state is 0"
Counter.Three: "working and my state is 7"
Counter.Two: "working and my state is 7"
Counter.One: "working and my state is 1"
So, we can see our first crash.
The process for Counter.One
failed with our raised error, and was restarted.
Because our default strategy in Elixir is one_for_one
, this is expected.
In the default configuration, we dont want one child processes failure to effect any others.
If we let it continue to 22 with Counter.One
, we would see the same behaviour (allow a crash without impacting any siblings, as its one for one).
Rest For One
Now let's try it with rest_for_one
.
Rest for one as a strategy starts the children in sequence, and if a later child fails the ones before it do too.
We want to change our line assigning opts
in lib/counter/application.ex
to state that.
# ...
children = [
Counter.One,
Counter.Two,
Counter.Three
]
opts = [strategy: :rest_for_one, name: Counter.Supervisor]
# ...
Now, let's start up again. These logs are also truncated to the interesting part
Counter.One: "working and my state is 3"
Counter.Two: "working and my state is 3"
Counter.Three: "working and my state is 3"
Counter.One: "working and my state is 4"
Counter.Two: "working and my state is 4"
Counter.Three: "working and my state is 4"
Counter.One: "working and my state is 5"
Counter.Two: "working and my state is 5"
Counter.Three: "starting"
18:30:56.925 [error] GenServer #PID<0.134.0> terminating
** (RuntimeError) I'm Counter.Three and I'm gonna error now
(counter) lib/counter/three.ex:33: Counter.Three.work/1
(counter) lib/counter/three.ex:21: Counter.Three.handle_info/2
(stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:686: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: :work
State: 5
Counter.Three: "working and my state is 0"
Counter.Three: "started"
Counter.One: "working and my state is 6"
Counter.Two: "working and my state is 6"
Counter.Three: "working and my state is 0"
Counter.One: "working and my state is 7"
Counter.Two: "working and my state is 7"
The key takeaway here is order matters.
Because Counter.Three
doesn't fail until 22
is its state, and Counter.One
fails with a state of 5
, Counter.Three
will force a restart of all 3 children since its last, but Counter.One
's failures have no effect on its siblings.
One For All
Now let's enable it with one_for_all
.
In this supervisory model if one child fails, all must be restarted.
To do this lets change lib/counter/application.ex
again.
opts = [strategy: :one_for_all, name: Counter.Supervisor]
If we start our app again with iex -S mix
we can see the behaviour as soon as Counter.Three
reaches a state of 5, but it again will confirm that it works the same again when we reach 22.
Counter.Two: "working and my state is 4"
Counter.One: "working and my state is 5"
Counter.Two: "working and my state is 5"
Counter.One: "starting"
Counter.One: "working and my state is 0"
Counter.One: "started"
Counter.Two: "starting"
Counter.Two: "working and my state is 0"
Counter.Two: "started"
Counter.Three: "starting"
Counter.Three: "working and my state is 0"
Counter.Three: "started"
18:34:56.122 [error] GenServer #PID<0.121.0> terminating
** (RuntimeError) I'm Counter.Three and I'm gonna error now
(counter) lib/counter/three.ex:33: Counter.Three.work/1
(counter) lib/counter/three.ex:21: Counter.Three.handle_info/2
(stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:686: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: :work
State: 5
Counter.One: "working and my state is 0"
Counter.Two: "working and my state is 0"
Counter.Three: "working and my state is 0"
We could also change the order in the children
variable match, and the same thing would happen.
That was a lot to take in, but hopefully the supervisory strategies of Elixir applications are a bit clearer now!