For a change of pace, I wanted to cover a simple topic:
asynchronous exceptions, and how they interact with Software
Transactional Memory and MVar
s, as well as the GHC
runtime system’s deadlock detection. As fate would have it, the
topics in question occurred twice in the past month, once in a
bug report
against the yaml package, and once in a major refactoring of FP
Complete’s distributed computation platform. I’ll focus on the
first since it’s an easier case to explain, but I’ll try to explain
the second (and jucier) one too.
To get this started off nicely, I’ll share an interesting piece of code and its output. If you’re playing along at home, try guessing what the output of this program will be, and if your guess is different from the actual output, try to predict why. Hopefully, by the end of this post, it will be clear (although still perhaps surprising).
#!/usr/bin/env stack -- stack --resolver ghc-7.10.3 runghc import Control.Concurrent import Control.Exception mayThrow1, mayThrow2, mayThrow3 :: IO Int mayThrow1 = return 5 -- nope, don't throw mayThrow2 = error "throw a normal exception" mayThrow3 = do var <- newEmptyMVar takeMVar var -- BlockedIndefinitelyOnMVar, always! -- implements the idea from: -- https://www.schoolofhaskell.com/user/snoyberg/general-haskell/exceptions/catching-all-exceptions -- -- Don't use the async package, which already works around the bug -- I'm demonstrating! tryAny :: IO a -> IO (Either SomeException a) tryAny action = do var <- newEmptyMVar uninterruptibleMask_ $ forkIOWithUnmask $ restore -> do res <- try $ restore action putMVar var res takeMVar var main :: IO () main = do tryAny mayThrow1 >>= print tryAny mayThrow2 >>= print tryAny mayThrow3 >>= print putStrLn "done!"
And actual output:
Right 5
Left throw a normal exception
example1.hs: thread blocked indefinitely in an MVar operation
To clarify, that last line of output means that the program
itself exited because a BlockedIndefinitelyOnMVar
exception was thrown to the main thread and was not caught. Also,
this is possibly the first time I’ve written an interesting code
example for a blog post that uses only the base package 🙂
There are two* ways an exception can be thrown in Haskell:
Synchronous exceptions are generated by the current thread. What’s important about these is that we generally want to be able to recover from them. For example, if you try to read from a file, and the file doesn’t exist, you may wish to use some default value instead of having your program exit, or perhaps prompt the user for a different file location.
Asynchronous exceptions are thrown by either a different
user thread, or by the runtime system itself. For example, in the
async
package, race
will kill the
longer-running thread with an asynchronous exception. Similarly,
the timeout
function will kill an action which has run
for too long. And epic foreshadowment the runtime system
will kill threads which appear to be deadlocked on
MVar
s or STM
actions.
In contrast to synchronous exceptions, we almost never want to recover from asynchronous exceptions. In fact, this is a common mistake in Haskell code, and from what I’ve seen has been the largest source of confusion and concern amongst users when it comes to Haskell’s runtime exception system.
Alright, got that? Catch the synchronous ones and recover, never recover from asynchronous exceptions**.
* You could argue that creating an exception in a pure value,
via functions like error
and undefined
,
is a third category. We’re going to ignore that for now.
** You may be wondering “why make it possible to catch async exceptions if we can’t recover from them?” The answer is that we still want to be able to perform cleanup actions, like closing file handles. This is the topic of the aforementioned upcoming blog post.
Quite a few years ago, I wrote a School of Haskell article on how to catch all exceptions. I’m not going to rehash that article here, since (1) the content is still current, and (2) I’ll likely be sharing a new blog post in a few days with a newer approach. But the gist of it is that if we want to catch all synchronous exceptions in an action without catching asynchronous ones, we run that action in a separate thread, and then grab the result from that thread using a mutable variable. The SoH article uses the async package, which hides away some of the details. However, the core idea is easy enough to express as I did above:
tryAny :: IO a -> IO (Either SomeException a) tryAny action = do var <- newEmptyMVar uninterruptibleMask_ $ forkIOWithUnmask $ restore -> do res <- try $ restore action putMVar var res takeMVar var
Let’s address some important points in the design of this function, since it has to step quite carefully around asynchronous exceptions:
forkIOWithUnmask
to ensure that the newly
created thread has asynchronous exceptions masked. This means that
we will receive no async exceptions in the new thread, until we
wrap an exception with the restore
function.action
, meaning async
exceptions can be received there. But we wrap that restore
action
with a try
, catching all synchronous and
asynchronous exceptions.try
we still have all async
exceptions masked, there’s nothing preventing the
putMVar
call in the child thread from being called.
This means that, in all cases of the action
terminating, the putMVar
will ensure that the
var
is filled up.takeMVar
in the parent thread will block until
the child thread completes. And since we’re guaranteed that the
var
will always be filled once the action
terminates, we seem (once again, epic foreshadowment) to be
guaranteed that the takeMVar
call will always wait for
action
to somehow terminate, and once it’s terminated
will return either the exception is died with or the success
result.Alright, I’ve foreshadowed enough already, I’m guessing
everyone’s figured out where the problem is going to come. Sure
enough, if we look at the program I started this post with, we see
that tryAny mayThrow3 >>= print
does not catch
the exception generated by mayThrow3
, but instead
itself dies from GHC’s deadlock protection
(BlockedIndefinitelyInMVar
). Simon Marlow explained
the reason for this to me twice, once in his (really
really good) book, and once when I raised issue #14 on
async because I hadn’t been smart enough to connect the dots
from the book to my actual issue.
As long as I’m linking so much to Github, you may also like seeing
my pull request
to work around issue #14.
Don’t worry, I’m not going to make you read all of that. The
summary is pretty simple: GHC’s run time system has very
straightforward deadlock detection. When it finds a group of
threads that are all blocked on blocking mutable variables
(MVar
s or STM
variables), and finds that
no other threads have references to the variables it question, it
determines that all of the threads are deadlocked, and sends all of
them asynchronous exceptions (either
BlockedIndefinitelyOnMVar
or
BlockedIndefinitelyOnSTM
).
So let’s analyze tryAny mayThrow3
in excruciating
detail. The series of events here is:
MVar
to allow the child thread
to pass back the result of mayThrow3
to the parent
thread.MVar
to get filled.try
to catch all
exceptions, and then run the user’s action (with async exceptions
unmasked).mayThrow3
to
deadlock on an MVar
. So now the child thread can’t
make progress, and is fully deadlocked.OK, the stage is set: the main thread is blocked on an
MVar
for the result. As I proved above, we know for a
fact that this MVar
will ultimately be filled with the
result of the user action, once the user action completes or dies
with an exception. However, GHC doesn’t know this, it just
sees the main thread blocked on an MVar
action, and
that only the main thread and child thread have references to that
variable.
In the child thread, we’ve created a new MVar
deadlock. This is a true and legitimate deadlock (if you don’t
believe me, go back and look at the implementation of
mayThrow3
). So GHC decides that both the main thread
and the child thread are currently blocked on MVar
actions, and there is no other thread that can unblock either of
them. And GHC is correct. However, the next part is the unfortunate
part:
GHC kills both threads with an asynchronous exceptions
Why is this unfortunate? Because we know that if the child
thread receives the async exception, it will exit, the result
MVar
will get filled, and the main thread can complete
successfully. That’s exactly the behavior we want, but not the
behavior GHC is able to deliver to us.
The workaround I’m about to demonstrate is not elegant, and it
leaves a bad taste in my mouth. If anyone has a better idea, let me
know. The idea is this: we know that there’s no legitimate reason
for the takeMVar
inside tryAny
to be
killed with a BlockedIndefinitelyOnMVar
, even if GHC
doesn’t know that. So we outsmart GHC by catching and ignoring that
specific exception. And just in case our logic is wrong and there’s
a flawed implementation here, we only catch-and-ignore a fixed
number of times (arbitrary choice: 10) to avoid creating a real
deadlock.
I’ve added a commit for enclosed-exceptions that implements this behavior, but for the lazy, here’s a simplified version of it:
tryAny :: IO a -> IO (Either SomeException a) tryAny action = do var <- newEmptyMVar uninterruptibleMask_ $ forkIOWithUnmask $ restore -> do res <- try $ restore action putMVar var res retryCount 10 (takeMVar var) where retryCount cnt0 action = loop cnt0 where loop 0 = action loop cnt = action `catch` BlockedIndefinitelyOnMVar -> loop (cnt - 1)
Swallow the bit of vomit rising in the back of your throat, try this out in the original program, and see if it solves the issue. You should see the following output:
Right 5
Left throw a normal exception
Left thread blocked indefinitely in an MVar operation
done!
Note that the “thread blocked” has actually been caught, and the
final done!
line is actually printed.
tryAny
was already implemented safely in terms of
the async package, which uses STM
in place of
MVar
s as we did here. Mitchell Rosen came up with a
really
convincing demonstration for why we shouldn’t use STM for this
purpose. To quote, with the program:
#!/usr/bin/env stack -- stack --resolver lts-6.3 runghc --package yaml {-# LANGUAGE OverloadedStrings #-} import Control.Concurrent.STM import Data.Yaml main = do mval <- atomically (return $! decode "") print (mval :: Maybe Value)
We get the rather surprising output:
Control.Concurrent.STM.atomically was nested
The reason is that decode
is calling a bunch of
IO
functions as FFI calls to the libyaml library, and
is wrapped in unsafePerformIO
to make the morally-pure
YAML-decoding action not live in IO
. Under the
surface, it uses tryAny
to avoid unexpected exceptions
from breaking out of the unsafePerformIO
dungeon it
lives in. That’s all well and good, but now if you use
decode
inside an atomically
block, you
have two layers of STM
occurring, which is strictly
forbidden.
Extra credit Figure out why I needed to use
$!
to make this fail.
This problem arose about two months ago, and is far more complicated than the examples I’ve discussed until now. In fact, I’ve so far been unsuccessful at reproducing this with a minimal test case, which is definitely annoying. Nonetheless, I can explain what basically happened:
TChan
TChan
race
function, which guarantees that once the filler exits, the worker
will be killed tooUnfortunately, this runs into the same problem I described
above, at least in some circumstances. The thread calling
race
is blocked on an MVar
under the
surface, which filler and worker both have access to. filler
completes, puts something in the MVar
, and then
disappears, losing its reference to the MVar
and the
TChan
. Meanwhile, worker is still blocked waiting on
the TChan
, to which only it has access now, resulting
in a deadlock condition.
In our application, we ended up with a situation where the
worker thread was deadlocked on the TChan
, and the
thread calling race
was blocked on the
MVar
. It’s a slightly more complicated case than this
though, since in the simple race
case, the thread
calling race
doesn’t actually get blocked on the
MVar
. Nonetheless, it’s another interesting case where
the deadlock prevention kills too many threads.
Fortunately for our application, there’s a very simple – and
more robust – solution, which I recommend to anyone doing this kind
of concurrent programming. The problem was that our worker thread
had no logic to terminate itself, and instead just waited to be
killed by race
. Instead, we switched over to a
closeable TChan
implementation, where the filler
thread could explicitly closed out the TChan
and then
the worker would kill itself instead of waiting around for an async
exception.
So to leave this blog post with some recommendations, and a small amount of extra foreshadowment (fourth time getting to use that word in one blog post!):
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.