Let It Crash: Creating an Example Supervisor in Elixir Using OTP
This is a post about what supervisors do (part 1), and how to build your own example supervisor that can restart dying processes (part 2).
Part 0: New Mix Project
To follow along and run the examples in iex
, create a new project to hold the code for our worker and supervisor:
mix new my_supervisor
Part 1: What do supervisors do?
To help understand what supervisors do, and why you might want to use them, lets take a look at some silly, unsupervised Elixir code that receives messages and parses those messages as integers:
Note: we call our function start_link
because that is what function the OTP application will call by default.
Lets tryout our Parser in iex
, by running iex -S mix
to load our code into the console:
That’s no good, not only did that message kill our Parser process, it also killed our iex
process because the two processes were linked together!
Lets try that again. This time we will prove that the two processes were linked together by using Process.info/2
to examine the :links
of each process, and also turn our normal iex
process into a system process that can handle exit messages by using Process.flag/2
:
Here is a screenshot of my terminal with the same thing (it is nicer to look at with color):
Since we called Process.flag/2
and turned our iex
process into a system process that can handle :EXIT
messages, we were able to call flush/0
and inspect the message that the Parser process sent back to us! The message contains information about the process and why it exited… Wouldn’t it be nice if there was another process that could take action on these kind of exit messages? 🤔🤔🤔
Part 2: Building Our Own Supervisor
In Part 1 we saw that our Parser process can be killed by trying parse an integer from an invalid string, but processes can also die for all sorts of other reasons (database connections, http timeouts, etc). This is why supervising process is so important; its really tough to handle all of the things that can go wrong, which is why often in Elixir you hear the phrase:
Let it crash!
People use this phrase because they have supervisors to handle exit messages from their dying processes. The supervisors can then decide what action to take to bring the system back to a stable state. The action that our supervisor will take will be simply to restart any process that fails. This is known as the :one_for_one
restart strategy.
We will need some processes to test our supervisor with; if you skipped Part 1, go back and copy the Parser into to a file called lib/Parser.ex
, this will be the worker that our supervisor will be looking after.
We won’t be making a fully featured supervisor here, ours will just start processes, and restart those same processes if they exit. First we will give the supervisor the functionality to start a list of processes:
The supervisor will be started by calling it’s start_link/1
function, which will take a list of child specifications and hand them to init/1
. The init/1
function starts by setting the :trap_exit
flag to true
, which as we saw in Part 1, will allow it to handle dying child processes without also crashing. The call to start_child/1
on each of the child specifications starts the child process and links it to the supervisor process.
Note: We are omitting error handling to keep the example short, in reality we would want to check that the pid
returned from apply
in start_child/1
is actually a process ID by calling is_pid/1
function.
The state
variable initialized by init/1
is just a Map where the keys are process IDs and the values are the specifications (we are storing those for when we need to restart the process). To gain access to them, we added list_processes/1
that makes a call for the state.
Lets test what we have so far:
Cool, our supervisor started the processes and linked to them! The third process returned from Process.info/2
is the iex
process that started the supervisor (prove this to yourself by running self()
and comparing the pid).
When we call MySupervisor.start_link/1
, why do we need to pass specifications? Why can’t we just pass in a list of pids for the supervisor to look after? The reason for doing that is because we need to know how to restart a process when it fails. For that reason our state is a map where the keys are pids and the values are instructions for restarting that process. We will use that later on.
Lets try sending some messages:
Since our supervisor called Process.flag(:trap_exit, true)
in init/1
, it was not brought down when the Parser process crashed (which is good), but the supervisor didn't do anything to correct the situation (which is bad).
As we saw in Part 1, the Parser process sent a message to its parent when it exited. The message looked something like this (but the reason
value was much bigger):
{:EXIT, #PID<0.124.0>, reason}
That parent used to be the iex
process, and we were able to view it by calling flush/0
, so where did the message end up this time?
In GenServers, any message that isn’t sent via GenServer.call
or GenServer.cast
and handled by handle_call/3
and handle_cast/2
respectively, ends up being handled by a different GenServer callback: handle_info/2
.
Lets add a handle_info/2
, that can handle these exit messages (from our research in Part 1, we already know what the message looks like, and how to match them):
Time to test our supervisor:
It works 😀, we can see from iex(2)
that we started two processes with pids: #PID<0.113.0>
and #PID<0.114.0>
, we then sent an invalid message to the first one in iex(4)
which killed that Parser process. When we re-asked the supervisor for its list of processes in iex(5)
, #PID<0.113.0>
was no longer in the list and was replaced with #PID<0.118.0>
. We then sent a valid message to the new process to verify that it was working. Nice!
Conclusion
What we made is a very simple supervisor that uses the equivalent of the GenServer :one_for_one
restart strategy: when a process exits, we restart a new process to take its place.
Question: Should I make my own supervisor for production systems?
Answer: No! The built-in Supervisor is more robust than anything we can build in a blog post. This is just a simple example to learn about how supervisors work.
Examples and documentation for the real Elixir Supervisor can be found here
Github project that contains all of the code snippets can be found here