There are a number of corner cases to consider when dealing with Docker, multiple processes, and signals. Probably the most famous post on this matter is from the Phusion blog. Here, we’ll see some examples of how to see these problems first hand, and one way to work around it: fpco/pid1.
The Phusion blog post recommends using their baseimage-docker.
This image provides a my_init entrypoint which handles
the problems described here, as well as introducing some extra OS
features, such as syslog handling. Unfortunately, we ran into
problems with Phusion’s usage of syslog-ng, in particular with it
creating unkillable processes pegged at 100% CPU usage. We’re still
investigating the root cause, but in practice we have found that
the syslog usage is a far less motivating case than simply a good
init process, which is why we’ve created the pid1 Haskell package
together with a simple fpco/pid1 Docker
image.
This blog post is intended to be interactive: you’ll get the
most bang for your buck by opening up your terminal and running
commands along with reading the text. It will be far more
motivating to see your Ctrl-C completely fail to kill
a process.
NOTE The primary reason we wrote our own implementation
in Haskell was to be able to embed it within the Stack build tool. There are other
lightweight init processes already available, such as 
dumb-init. I’ve also blogged about using
dumb-init. While this post uses pid1, there’s
nothing specific to it versus other init processes.
Docker has a concept of entrypoints, which provides a default
wrapping command for commands you provides to docker
run. For example, consider this interaction with Docker:
$ docker run --entrypoint /usr/bin/env ubuntu:16.04 FOO=BAR bash c 'echo $FOO'
BAR
This works because the above is equivalent to:
$ docker run ubuntu:16.04 /usr/bin/env FOO=BAR bash -c 'echo $FOO'
Entrypoints can be overridden on the command line (as we just
did), but can also be specified in the Dockerfile (which we’ll do
later). The default entrypoint for the ubuntu Docker image is a
null entrypoint, meaning that the provided command will be run
directly without any wrapping. We’re going to simulate that
experience by using /usr/bin/env as an entrypoint,
since switching
entrypoint back to null isn’t yet supported in released Docker.
When you run /usr/bin/env foo bar baz, the
env process will exec the
foo command, making foo the new PID 1,
which for our purposes gives it the same behavior as a null
entrypoint.
Both the fpco/pid1 and
snoyberg/docker-testing images we’ll use below set
/sbin/pid1 as the default entrypoint. In the example
commands, we’re explicitly including --entrypoint
/sbin/pid1. This is just to be clear on which entrypoint is
being used; if you exclude that option, the same behavior will
persist.
We’ll start with our 
sigterm.hs program, which runs ps (we’ll see why
soon), then sends itself a SIGTERM and then loops
forever. On a Unix system, the default process behavior when
receiving a SIGTERM is to exit. Therefore, we’d expect
that our process will just exit when run. Let’s see:
$ docker run --rm --entrypoint /usr/bin/env snoyberg/docker-testing sigterm
  PID TTY          TIME CMD
    1 ?        00:00:00 sigterm
    9 ?        00:00:00 ps
Still alive!
Still alive!
Still alive!
^C
$
The process ignored the SIGTERM and kept running,
until I hit Ctrl-C (we’ll see what that does later). Another
feature in the sigterm code base, though, is that if you give it
the command line argument install-handler, it will
explicitly install a SIGTERM handler which will kill the process.
Perhaps surprisingly, this has a significant impact on our
application:
$ docker run --rm --entrypoint /usr/bin/env snoyberg/docker-testing sigterm install-handler
  PID TTY          TIME CMD
    1 ?        00:00:00 sigterm
    8 ?        00:00:00 ps
Still alive!
$
The reason for this is some Linux kernel magic: the kernel
treats a process with PID 1 specially, and does not, by
default, kill the process when receiving the SIGTERM
or SIGINT signals. This can be very surprising
behavior. For a simpler example, try running the following commands
in two different terminals:
$ docker run --rm --name sleeper ubuntu:16.04 sleep 100
$ docker kill -s TERM sleeperNotice how the docker run command does not exit,
and if you check your ps aux output, you’ll see that
the process is still running. That’s because the sleep
process was not designed to be PID 1, and does not install a
special signal handler. To work around this problem, you’ve got two
choices:
docker run has
explicit handling of SIGTERM.SIGTERM
correctly.Let’s see how the sigterm program works with our
/sbin/pid1 entrypoint:
$ docker run --rm --entrypoint /sbin/pid1 snoyberg/docker-testing sigterm
  PID TTY          TIME CMD
    1 ?        00:00:00 pid1
    8 ?        00:00:00 sigterm
   12 ?        00:00:00 ps
The program exits immediately, as we’d like. But look at the
ps output: our first process is now pid1
instead of sigterm. Since sigerm is being
launched as a different PID (8 in this case), the special casing
from the Linux kernel does not come into play, and default
SIGTERM handling is active. To step through exactly
what happens in our case:
/usr/sbin/pid1
sigterm is run inside of it.pid1 starts as PID-1, does its business, and then
fork/execs the sigterm
executable.sigterm raises the SIGTERM signal to
itself, causing it to die.pid1 sees that its child died from SIGTERM (==
signal 15) and exits with exit code 143 (== 128 + 15).This isn’t just some magic with sigterm, you can do
the same thing with sleep:
$ docker run --rm --name sleeper fpco/pid1 sleep 100
$ docker kill -s TERM sleeperUnlike with the ubuntu image, this will kill the
container immediately, due to the /sbin/pid1
entrypoint used by fpco/pid1.
NOTE In the case of sigterm, which sends the
TERM signal to itself, it turns out you don’t need a special PID1
process with signal handling, anything will do. For example, try
docker run --rm --entrypoint /usr/bin/env
snoyberg/docker-testing /bin/bash -c "sigterm;echo bye". But
playing with sleep will demonstrate the need for a
real signal-aware PID1 process.
There’s a slight difference between sigterm and
sleep when it comes to the behavior of sending hitting
Ctrl-C. When you use Ctrl-C, it sends a
SIGINT to the docker run process, which
proxies that signal to the process inside the container.
sleep will ignore it, just as it ignores
SIGTERM, due to the default signal handlers for PID1
in the Linux kernel. However, the sigterm executable
is written in Haskell, and the Haskell runtime itself
installs a signal handler that converts SIGINT into a
user interrupt exception, overriding the PID1 default behavior. For
more on signal proxying, see the docker
attach documentation.
Suppose you have process A, which
fork/execs process B. When process B
dies, process A must call waitpid to get its exit
status from the kernel, and until it does so, process B will be
dead but with an entry in the system process table. This is known
as being a zombie.
But what happens if process B outlives process A? In this case, process B is known as an orphan, and needs to be adopted by the init process, aka PID1. It is the init process’s job to reap orphans so they do not remain as zombies.
The orphans.hs program will:
psecho command a few
times, without calling waitpid, and then exitAs you can see, none of the processes involved will reap the
zombie echo processes. The output from the process
confirms that we have, in fact, created zombies:
$ docker run --rm --entrypoint /usr/bin/env snoyberg/docker-testing orphans
1
2
3
4
Still alive!
  PID TTY          TIME CMD
    1 ?        00:00:00 orphans
    8 ?        00:00:00 orphans
   13 ?        00:00:00 echo <defunct>
   14 ?        00:00:00 echo <defunct>
   15 ?        00:00:00 echo <defunct>
   16 ?        00:00:00 echo <defunct>
   17 ?        00:00:00 ps
Still alive!
  PID TTY          TIME CMD
    1 ?        00:00:00 orphans
   13 ?        00:00:00 echo <defunct>
   14 ?        00:00:00 echo <defunct>
   15 ?        00:00:00 echo <defunct>
   16 ?        00:00:00 echo <defunct>
   18 ?        00:00:00 ps
Still alive!
And so on until we kill the container. That
<defunct> indicates a zombie process. The issue
is that our PID 1, orphans, doesn’t do reaping. As you probably
guessed, we can solve this by just using the
/sbin/pid1 entrypoint:
$ docker run --rm --entrypoint /sbin/pid1 snoyberg/docker-testing orphans
1
2
3
4
Still alive!
  PID TTY          TIME CMD
    1 ?        00:00:00 pid1
   10 ?        00:00:00 orphans
   14 ?        00:00:00 orphans
   19 ?        00:00:00 echo <defunct>
   20 ?        00:00:00 echo <defunct>
   21 ?        00:00:00 echo <defunct>
   22 ?        00:00:00 echo <defunct>
   23 ?        00:00:00 ps
Still alive!
  PID TTY          TIME CMD
    1 ?        00:00:00 pid1
   10 ?        00:00:00 orphans
   24 ?        00:00:00 ps
Still alive!
pid1 now adopts the echo processes
when the child orphans process dies, and reaps
accordingly.
Let’s try out something else: process A is the primary command for the Docker container, and it spawns process B. Before process B exits, process A exits, causing the Docker container to exit. In this case, the running process B will be forcibly closed by the kernel (see this Stack Overflow question for details). We can see this with our surviving.hs program
$ docker run --rm --entrypoint /usr/bin/env snoyberg/docker-testing surviving
Parent sleeping
Child: 1
Child: 2
Child: 4
Child: 3
Child: 1
Child: 2
Child: 3
Child: 4
Parent exiting
Unfortunately this doesn’t give our child processes a chance to
do any cleanup. Instead, we would rather send them a
SIGTERM, and after a grace period send them a
SIGKILL. This is exactly what pid1
does:
$ docker run --rm --entrypoint /sbin/pid1 snoyberg/docker-testing surviving
Parent sleeping
Child: 2
Child: 3
Child: 1
Child: 4
Child: 2
Child: 1
Child: 4
Child: 3
Parent exiting
Got a TERM
Got a TERM
Got a TERM
Got a TERM
docker run vs PID1When you run sleep 60 and then hit Ctrl-C, the
sleep process itself receives a SIGINT.
When you instead run docker run --rm fpco/pid1 sleep
60 and hit Ctrl-C, you may think that the same thing is
happening. However, in reality, it’s not at all the same. Your
docker run call creates a docker run
process, which sends a command to the Docker daemon on your
machine, and that daemon creates the actual sleep
process (inside a container). When you hit Ctrl-C on your terminal,
you’re sending SIGINT to docker run,
which is in fact sending a command to the Docker daemon, which in
turn sends a SIGINT to your sleep
process.
Want proof? Try out the following:
$ docker run --rm fpco/pid1 sleep 60&
[1] 417
$ kill -KILL $!
$ docker ps
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS              PORTS               NAMES
69fbc70e95e2        fpco/pid1                   "/sbin/pid1 sleep 60"    11 seconds ago      Up 11 seconds                           hopeful_mayer
[1]+  Killed                  docker run --rm fpco/pid1 sleep 60In this case, we sent a SIGKILL to the docker
run command. Unlike SIGINT or
SIGTERM, and SIGKILL cannot be handled,
and therefore docker run is unable to delegate signal
handling to a different process. As a result, the docker
run command itself dies, but the sleep process
(and its container) continue running.
Some takeaways from this:
pid1 so that your
SIGINT or SIGTERM to the docker
run process actually get your container to reliably shut
downSIGKILL to your process,
use the docker kill command insteadWe’ve used --entrypoint /sbin/pid1 a lot here. In
fact, each usage of that has been superfluous, since the
fpco/pid1 and snoyberg/docker-testing
images both use /sbin/pid1 as their default entrypoint
anyway. I included it for explicitness. To prove it to you:
$ docker run --rm fpco/pid1 sleep 60
^C$But if you don’t want to muck with entrypoints, you can always
just include /sbin/pid1 at the beginning of your
command, e.g.:
$ docker run --rm --entrypoint /usr/bin/env fpco/pid1 /sbin/pid1 sleep 60
^C$
And if you have your own Docker image and you’d just like to
include the pid1 executable, you can download it from
the Github releases
page.
You may be tempted to put something like ENTRYPOINT
/sbin/pid1 in your Dockerfile. Let’s see why that won’t
work:
$ cat Dockerfile
FROM fpco/pid1
ENTRYPOINT /sbin/pid1
$ docker build --tag test .
Sending build context to Docker daemon 2.048 kB
Step 1 : FROM fpco/pid1
 ---> aef1f7b702b9
Step 2 : ENTRYPOINT /sbin/pid1
 ---> Using cache
 ---> f875b43a9e40
Successfully built f875b43a9e40
$ docker run --rm test ps
pid1: No arguments providedThe issue here is that we specified /sbin/pid1 in what Docker
calls command form. This is just a raw string which is
interpreted by the shell. It is unable to be passed an additional
command (like ps), and therefore pid1
itself complains that it hasn’t been told what to run. The correct
way to specify your entrypoint is ENTRYPOINT
["/sbin/pid1"], e.g.:
$ cat Dockerfile
FROM fpco/pid1
ENTRYPOINT ["/sbin/pid1"]
$ docker build --tag test .
Sending build context to Docker daemon 2.048 kB
Step 1 : FROM fpco/pid1
 ---> aef1f7b702b9
Step 2 : ENTRYPOINT /sbin/pid1
 ---> Running in ba0fa8c5bd41
 ---> 4835dec4aae6
Removing intermediate container ba0fa8c5bd41
Successfully built 4835dec4aae6
$ docker run --rm test ps
  PID TTY          TIME CMD
    1 ?        00:00:00 pid1
    8 ?        00:00:00 psGenerally speaking, you should stick with command form in your Dockerfiles at all times. It is explicit about whitespace handling, and avoids the need to use a shell as an interpreter.
The main takeaway here is: unless you have a good reason to do
otherwise, you should use a minimal init process like
pid1. The Phusion/my_init approach works, but may be
too heavy weight for some. If you don’t need syslog and other
add-on features of Phusion, you’re probably best with a minimal
init instead.
As a separate but somewhat related comment: we’re going to have
a follow up post on this blog in the coming days explaining how we
compiled the pid1 executable as a static executable to
make it compatible with all various Linux flavors, and how you can
do the same for your Haskell executables. Stay tuned!
  
    Subscribe to our blog via email
  
  
  
  Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.