For a change of pace, I wanted to cover a simple topic:
asynchronous exceptions, and how they interact with Software
Transactional Memory and MVar
s, as well as the GHC
runtime system's deadlock detection. As fate would have it, the
topics in question occurred twice in the past month, once in a
bug report
against the yaml package, and once in a major refactoring of FP
Complete's distributed computation platform. I'll focus on the
first since it's an easier case to explain, but I'll try to explain
the second (and jucier) one too.
To get this started off nicely, I'll share an interesting piece
of code and its output. If you're playing along at home, try
guessing what the output of this program will be, and if your guess
is different from the actual output, try to predict why. Hopefully,
by the end of this post, it will be clear (although still perhaps
surprising).
#!/usr/bin/env stack
import Control.Concurrent
import Control.Exception
mayThrow1, mayThrow2, mayThrow3 :: IO Int
mayThrow1 = return 5
mayThrow2 = error "throw a normal exception"
mayThrow3 = do
var <- newEmptyMVar
takeMVar var
tryAny :: IO a -> IO (Either SomeException a)
tryAny action = do
var <- newEmptyMVar
uninterruptibleMask_ $ forkIOWithUnmask $ \restore -> do
res <- try $ restore action
putMVar var res
takeMVar var
main :: IO ()
main = do
tryAny mayThrow1 >>= print
tryAny mayThrow2 >>= print
tryAny mayThrow3 >>= print
putStrLn "done!"
And actual output:
Right 5
Left throw a normal exception
example1.hs: thread blocked indefinitely in an MVar operation
To clarify, that last line of output means that the program
itself exited because a BlockedIndefinitelyOnMVar
exception was thrown to the main thread and was not caught. Also,
this is possibly the first time I've written an interesting code
example for a blog post that uses only the base package :)
Synchronous vs
asynchronous exceptions
There are two* ways an exception can be thrown in Haskell:
-
Synchronous exceptions are generated by the current
thread. What's important about these is that we generally want to
be able to recover from them. For example, if you try to read from
a file, and the file doesn't exist, you may wish to use some
default value instead of having your program exit, or perhaps
prompt the user for a different file location.
-
Asynchronous exceptions are thrown by either a different
user thread, or by the runtime system itself. For example, in the
async
package, race
will kill the
longer-running thread with an asynchronous exception. Similarly,
the timeout
function will kill an action which has run
for too long. And epic foreshadowment the runtime system
will kill threads which appear to be deadlocked on
MVar
s or STM
actions.
In contrast to synchronous exceptions, we almost never want to
recover from asynchronous exceptions. In fact, this is a common
mistake in Haskell code, and from what I've seen has been the
largest source of confusion and concern amongst users when it comes
to Haskell's runtime exception system.
Alright, got that? Catch the synchronous ones and recover, never
recover from asynchronous exceptions**.
* You could argue that creating an exception in a pure value,
via functions like error
and undefined
,
is a third category. We're going to ignore that for now.
** You may be wondering "why make it possible to catch async
exceptions if we can't recover from them?" The answer is that we
still want to be able to perform cleanup actions, like closing file
handles. This is the topic of the aforementioned upcoming blog
post.
Catching all exceptions
Quite a few years ago, I wrote a
School of Haskell article on how to catch all exceptions. I'm
not going to rehash that article here, since (1) the content is
still current, and (2) I'll likely be sharing a new blog post in a
few days with a newer approach. But the gist of it is that if we
want to catch all synchronous exceptions in an action without
catching asynchronous ones, we run that action in a separate
thread, and then grab the result from that thread using a mutable
variable. The SoH article uses the async package, which hides away
some of the details. However, the core idea is easy enough to
express as I did above:
tryAny :: IO a -> IO (Either SomeException a)
tryAny action = do
var <- newEmptyMVar
uninterruptibleMask_ $ forkIOWithUnmask $ \restore -> do
res <- try $ restore action
putMVar var res
takeMVar var
Let's address some important points in the design of this
function, since it has to step quite carefully around asynchronous
exceptions:
- We use
forkIOWithUnmask
to ensure that the newly
created thread has asynchronous exceptions masked. This means that
we will receive no async exceptions in the new thread, until we
wrap an exception with the restore
function.
- We restore the user's
action
, meaning async
exceptions can be received there. But we wrap that restore
action
with a try
, catching all synchronous and
asynchronous exceptions.
- Since outside of the
try
we still have all async
exceptions masked, there's nothing preventing the
putMVar
call in the child thread from being called.
This means that, in all cases of the action
terminating, the putMVar
will ensure that the
var
is filled up.
- The
takeMVar
in the parent thread will block until
the child thread completes. And since we're guaranteed that the
var
will always be filled once the action
terminates, we seem (once again, epic foreshadowment) to be
guaranteed that the takeMVar
call will always wait for
action
to somehow terminate, and once it's terminated
will return either the exception is died with or the success
result.
Deadlock protection
Alright, I've foreshadowed enough already, I'm guessing
everyone's figured out where the problem is going to come. Sure
enough, if we look at the program I started this post with, we see
that tryAny mayThrow3 >>= print
does not catch
the exception generated by mayThrow3
, but instead
itself dies from GHC's deadlock protection
(BlockedIndefinitelyInMVar
). Simon Marlow explained
the reason for this to me twice, once in his (really
really good) book, and once when I raised issue #14 on
async because I hadn't been smart enough to connect the dots
from the book to my actual issue.
As long as I'm linking so much to Github, you may also like seeing
my pull request
to work around issue #14.
Don't worry, I'm not going to make you read all of that. The
summary is pretty simple: GHC's run time system has very
straightforward deadlock detection. When it finds a group of
threads that are all blocked on blocking mutable variables
(MVar
s or STM
variables), and finds that
no other threads have references to the variables it question, it
determines that all of the threads are deadlocked, and sends all of
them asynchronous exceptions (either
BlockedIndefinitelyOnMVar
or
BlockedIndefinitelyOnSTM
).
So let's analyze tryAny mayThrow3
in excruciating
detail. The series of events here is:
- Create a new empty
MVar
to allow the child thread
to pass back the result of mayThrow3
to the parent
thread.
- Fork a new child thread (yes, exceptions are masked, but that's
not terribly important for us).
- Back in the parent thread: block waiting for that empty
MVar
to get filled.
- In the child thread, set up
try
to catch all
exceptions, and then run the user's action (with async exceptions
unmasked).
- As it happens, I carefully crafted
mayThrow3
to
deadlock on an MVar
. So now the child thread can't
make progress, and is fully deadlocked.
OK, the stage is set: the main thread is blocked on an
MVar
for the result. As I proved above, we know for a
fact that this MVar
will ultimately be filled with the
result of the user action, once the user action completes or dies
with an exception. However, GHC doesn't know this, it just
sees the main thread blocked on an MVar
action, and
that only the main thread and child thread have references to that
variable.
In the child thread, we've created a new MVar
deadlock. This is a true and legitimate deadlock (if you don't
believe me, go back and look at the implementation of
mayThrow3
). So GHC decides that both the main thread
and the child thread are currently blocked on MVar
actions, and there is no other thread that can unblock either of
them. And GHC is correct. However, the next part is the unfortunate
part:
GHC kills both threads with an asynchronous
exceptions
Why is this unfortunate? Because we know that if the child
thread receives the async exception, it will exit, the result
MVar
will get filled, and the main thread can complete
successfully. That's exactly the behavior we want, but not the
behavior GHC is able to deliver to us.
The workaround
The workaround I'm about to demonstrate is not elegant, and it
leaves a bad taste in my mouth. If anyone has a better idea, let me
know. The idea is this: we know that there's no legitimate reason
for the takeMVar
inside tryAny
to be
killed with a BlockedIndefinitelyOnMVar
, even if GHC
doesn't know that. So we outsmart GHC by catching and ignoring that
specific exception. And just in case our logic is wrong and there's
a flawed implementation here, we only catch-and-ignore a fixed
number of times (arbitrary choice: 10) to avoid creating a real
deadlock.
I've added a
commit for enclosed-exceptions that implements this behavior,
but for the lazy, here's a simplified version of it:
tryAny :: IO a -> IO (Either SomeException a)
tryAny action = do
var <- newEmptyMVar
uninterruptibleMask_ $ forkIOWithUnmask $ \restore -> do
res <- try $ restore action
putMVar var res
retryCount 10 (takeMVar var)
where
retryCount cnt0 action =
loop cnt0
where
loop 0 = action
loop cnt = action `catch`
\BlockedIndefinitelyOnMVar -> loop (cnt 1)
Swallow the bit of vomit rising in the back of your throat, try
this out in the original program, and see if it solves the issue.
You should see the following output:
Right 5
Left throw a normal exception
Left thread blocked indefinitely in an MVar operation
done!
Note that the "thread blocked" has actually been caught, and the
final done!
line is actually printed.
Problem with STM-based
tryAny
tryAny
was already implemented safely in terms of
the async package, which uses STM
in place of
MVar
s as we did here. Mitchell Rosen came up with a
really
convincing demonstration for why we shouldn't use STM for this
purpose. To quote, with the program:
#!/usr/bin/env stack
import Control.Concurrent.STM
import Data.Yaml
main = do
mval <- atomically (return $! decode "")
print (mval :: Maybe Value)
We get the rather surprising output:
Control.Concurrent.STM.atomically was nested
The reason is that decode
is calling a bunch of
IO
functions as FFI calls to the libyaml library, and
is wrapped in unsafePerformIO
to make the morally-pure
YAML-decoding action not live in IO
. Under the
surface, it uses tryAny
to avoid unexpected exceptions
from breaking out of the unsafePerformIO
dungeon it
lives in. That's all well and good, but now if you use
decode
inside an atomically
block, you
have two layers of STM
occurring, which is strictly
forbidden.
Extra credit Figure out why I needed to use
$!
to make this fail.
Our distributed
computing problem
This problem arose about two months ago, and is far more
complicated than the examples I've discussed until now. In fact,
I've so far been unsuccessful at reproducing this with a minimal
test case, which is definitely annoying. Nonetheless, I can explain
what basically happened:
- Have a queue of work items to be performed, represented by a
TChan
- Have one thread (filler) grabbing work items from somewhere
(like a socket) and putting them on the
TChan
- Have another thread (worker) grabbing work items and performing
them
- The filler has logic to detect when no more items are incoming,
and exits
- Run these two threads together using the
race
function, which guarantees that once the filler exits, the worker
will be killed too
Unfortunately, this runs into the same problem I described
above, at least in some circumstances. The thread calling
race
is blocked on an MVar
under the
surface, which filler and worker both have access to. filler
completes, puts something in the MVar
, and then
disappears, losing its reference to the MVar
and the
TChan
. Meanwhile, worker is still blocked waiting on
the TChan
, to which only it has access now, resulting
in a deadlock condition.
In our application, we ended up with a situation where the
worker thread was deadlocked on the TChan
, and the
thread calling race
was blocked on the
MVar
. It's a slightly more complicated case than this
though, since in the simple race
case, the thread
calling race
doesn't actually get blocked on the
MVar
. Nonetheless, it's another interesting case where
the deadlock prevention kills too many threads.
Fortunately for our application, there's a very simple - and
more robust - solution, which I recommend to anyone doing this kind
of concurrent programming. The problem was that our worker thread
had no logic to terminate itself, and instead just waited to be
killed by race
. Instead, we switched over to a
closeable TChan
implementation, where the filler
thread could explicitly closed out the TChan
and then
the worker would kill itself instead of waiting around for an async
exception.
So to leave this blog post with some recommendations, and a
small amount of extra foreshadowment (fourth time getting to use
that word in one blog post!):
- Explicit exit conditions are better than implicit ones
- Avoid relying on async exceptions for control flow if you
can
- Relying on the deadlock detection to clean up your messes is a
bad idea
- And even if you don't want the deadlock detection to clean up
your messes, it may try to do so for you badly anyway
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.
Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.