This is a story about how some bad API design on my part caused some ugly race conditions that were very tricky to break down. I’m writing this story as a word of warning to others! The code itself was written in Haskell, but the lessons apply to anyone working with Unix-style processes.
typed-processI maintain both the process library in Haskell,
which is the standard way of launching child processes, as well as
the typed-process library, which explores some
refinements to that API for more user friendliness. The API has two
main types: ProcessConfig defines settings for
launching a process (command name, environment variables, etc), and
Process represents a running child process that can be
interacted with. With that, we have some basic API usage that looks
like this:
let processConfig = proc "some-executable" ["--flag1", "--flag2"]
process <- startProcess processConfig
hPut (getStdin process) "Input to the process"
output <- hGet (getStdout process)
helperFunction output
hPut (getStdin process) "quit" -- tell the process to quit
exitCode <- waitExitCode process
logInfo $ "Process exited with code " <> displayShow exitCode
This isn’t quite working code, but it gets the idea across pretty nicely.
There’s a problem with the code above: it’s not exception-safe.
Let’s say that the helperFunction call fails with a
runtime exception. The child process will never receive the
"quit" input, we’ll never wait for the child process
to end, and ultimately we’ll end up with a process that’s sitting
around, twiddling its thumbs, unable to ever exit. (You may think
this is a zombie process, but zombie has a specific and different
meaning in the Unix world.)
The Haskell ecosystem, like many others, has a method for
providing exception safety. We call it the bracket
pattern. You combine together resource allocation and cleanup
actions using the helper function bracket, and are
guaranteed when your block is finished, the cleanup action is
called, regardless of how the block finishes.
To make this work, we need a stopProcess function.
This function is intelligent: if the process has already exited,
stopProcess doesn’t do anything. However, if the
process is still running, stopProcess sends it a
SIGTERM signal, which for most well-behaved programs
will cause it to exit. (Unix processes can actually handle
SIGTERM and continue running, but for our cases we’ll
pretend like it’s a process death sentence.)
So let’s rewrite the code above with bracket:
let processConfig = proc "some-executable" ["--flag1", "--flag2"]
bracket (startProcess processConfig) stopProcess $ proccess -> do
hPut (getStdin process) "Input to the process"
output <- hGet (getStdout process)
helperFunction output
hPut (getStdin process) "quit" -- tell the process to quit
exitCode <- waitExitCode process
logInfo ("Process exited with code " <> displayShow exitCode)
And just like that, we have type safety, and avoid runaway processes. Neato!
Let’s walk through the cases above. If any of the
actions in the block throw a runtime exception,
bracket will trigger stopProcess,
resulting in a SIGTERM being sent to the child. If, on
the other hand, no exception occurs, we know that the child process
has already exited thanks to the waitExitCode call,
and therefore stopProcess will be a no-op. That’s
exactly the behavior we want.
Following Haskell best practices, we can capture this
bracket call into a helper function called
withProcess:
withProcess config = bracket (startProcess config) stopProcess
let processConfig = proc "some-executable" ["--flag1", "--flag2"]
withProcess processConfig $ proccess -> do
hPut (getStdin process) "Input to the process"
output <- hGet (getStdout process)
helperFunction output
hPut (getStdin process) "quit" -- tell the process to quit
exitCode <- waitExitCode process
logInfo ("Process exited with code " <> displayShow exitCode)
And exception safety has been achieved!
Finally, one more addition. A common pattern in working with
child processes is checking that the exit code is a success, and
throwing an exception if it’s anything else. We have a helper
function withProcess_ that performs that exit code
checking too. This essentially looks like:
withProcess_ config = bracket
(startProcess config)
(process -> do
stopProcess process
checkExitCode process)
We’re going to perform a cardinal Unix sin: use the
cat executable when we’re not actually combining
together two different files. Please forgive me, it’s for a good
reason.
Below is a fully runnable Haskell script. You can install Stack,
copy the code into Main.hs, and run stack
Main.hs to run it. The program does the following:
cat with no
argumentswithProcess_Hello World!n to
the child over standard input and then close the pipe#!/usr/bin/env stack
-- stack --resolver lts-13.26 script
{-# LANGUAGE OverloadedStrings #-}
import Control.Concurrent.Async (concurrently)
import qualified Data.ByteString as B
import System.IO (hClose, stdout)
import System.Process.Typed
main :: IO ()
main = do
let config = setStdin createPipe
$ setStdout createPipe
$ proc "cat" []
((), output) <- withProcess_ config $ process -> concurrently
(do B.hPut (getStdin process) "Hello World!n"
hClose (getStdin process))
(do B.hGetContents (getStdout process))
B.hPut stdout output
When I run this on OS X, I fairly reliably get the expected output:
$ stack Main.hs
Hello World!
However, when I run this on Linux, I will often get the following instead:
$ stack Main.hs
Main.hs: Received ExitFailure (-15) when running
Raw command: cat
Granted, not always, but often enough. So now we have a weird exit failure and some non-determinism, in what appears to be a really simple program. What gives?!?
The first thing to identify is what this negative exit code is.
Haskell—like a few other ecosystems—uses a negative exit code to
indicate that the process exited due to a signal. In this case,
that means the child process (cat) died with signal
number 15, which is SIGTERM. That’s certainly
interesting… where have we seen a SIGTERM come up
before? Right, in stopProcess.
But it doesn’t quite make sense that stopProcess
would send the signal, since it only does so once the standard
output pipe from the child process has been closed. And we know
that cat exits at exactly the same time as it
closes its standard output pipe… right?
Hopefully my scare italics above helped a bit. No, as it turns
out, the pipe’s closure and the child’s exit are not
simultaneous. In fact, our cat process will end up
doing something like the following:
read from stdinwrite to stdout and return
to step 1stdinstdoutThe parent process, meanwhile, will repeatedly call
read on the read end of the child’s
stdout pipe, and as soon as that read
indicates end of file (EOF), the block will exit, and
withProcess_ will do two things:
stopProcesscheckExitCode to make sure the process exited
successfullyThere are multiple interleavings of events that can occur. The success case looks like this:
stdoutreadstopProcess, which is a no-op (child
is already exited)checkExitCode gets exit code 0 and is happyHowever, it’s also possible with a different process timing to get:
stdoutreadstopProcess, which sends a
SIGTERM to the childcheckExitCode sees that the child exited due to a
SIGTERM and throws an exceptionThis may seem like a corner case, but it’s already bitten me twice: first in a test suite, and secondly as a major annoyance in the new Stack release.
Well, as usually, the person to blame is myself.

Usage of the Unix process API can be tricky to get right, but
it’s clearly documented and well executed. And I’d argue that my
usage of withProcess_ is the right kind of
abstraction. No, the problem is the implementation of
withProcess_. Let’s step through it again:
stopProcess and then ensure there’s a success exit
codeIn our first usage above, we called waitExitCode in
the block, which guaranteed in the success case
stopProcess would always end up as a no-op. Everything
was fine. The problem was I made the assumption that
cat‘s pipes closing was the same as the child process
exiting. We know that’s not true. However, given that this bug hit
me twice, it’s fair to say I’ve created an API which encourages
misuse.
Instead, here’s what I think is the better implementation for
withProcess_:
stopProcessWith this tweak to behavior, the code calling cat
above is safe, and I can sleep better at night.
Rolling out a new set of behavior which silently (meaning: no
compile-time change) modifies behavior at runtime is dangerous.
People using withProcess_ may be relying on exactly
its current behavior. Therefore, instead of replacing the current
withProcess_ behavior, the roll-out strategy is:
withProcessTerm_, which
has the same behavior as withProcess_ todaywithProcessWait_, which
has the new behavior I just described abovewithProcess_ with a message indicating
that the caller should use one of the replacement functions
insteadThis will encourage users of typed-process to
analyze their usages of withProcess_, see if they are
susceptible to the bug described here, and choose the appropriate
replacement.
If you’re interested in learning more about any of this, here are some (hopefully) helpful links:
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.