async exceptions, STM, and deadlocks

For a change of pace, I wanted to cover a simple topic: asynchronous exceptions, and how they interact with Software Transactional Memory and MVars, as well as the GHC runtime system's deadlock detection. As fate would have it, the topics in question occurred twice in the past month, once in a bug report against the yaml package, and once in a major refactoring of FP Complete's distributed computation platform. I'll focus on the first since it's an easier case to explain, but I'll try to explain the second (and jucier) one too.

To get this started off nicely, I'll share an interesting piece of code and its output. If you're playing along at home, try guessing what the output of this program will be, and if your guess is different from the actual output, try to predict why. Hopefully, by the end of this post, it will be clear (although still perhaps surprising).

#!/usr/bin/env stack
-- stack --resolver ghc-7.10.3 runghc
import Control.Concurrent
import Control.Exception

mayThrow1, mayThrow2, mayThrow3 :: IO Int

mayThrow1 = return 5 -- nope, don't throw

mayThrow2 = error "throw a normal exception"

mayThrow3 = do
    var <- newEmptyMVar
    takeMVar var -- BlockedIndefinitelyOnMVar, always!

-- implements the idea from:
-- https://www.schoolofhaskell.com/user/snoyberg/general-haskell/exceptions/catching-all-exceptions
--
-- Don't use the async package, which already works around the bug
-- I'm demonstrating!
tryAny :: IO a -> IO (Either SomeException a)
tryAny action = do
    var <- newEmptyMVar
    uninterruptibleMask_ $ forkIOWithUnmask $ \restore -> do
        res <- try $ restore action
        putMVar var res
    takeMVar var

main :: IO ()
main = do
    tryAny mayThrow1 >>= print
    tryAny mayThrow2 >>= print
    tryAny mayThrow3 >>= print

    putStrLn "done!"

And actual output:

Right 5
Left throw a normal exception
example1.hs: thread blocked indefinitely in an MVar operation

To clarify, that last line of output means that the program itself exited because a BlockedIndefinitelyOnMVar exception was thrown to the main thread and was not caught. Also, this is possibly the first time I've written an interesting code example for a blog post that uses only the base package :)

Synchronous vs asynchronous exceptions

There are two* ways an exception can be thrown in Haskell:

Synchronous exceptions are generated by the current thread. What's important about these is that we generally want to be able to recover from them. For example, if you try to read from a file, and the file doesn't exist, you may wish to use some default value instead of having your program exit, or perhaps prompt the user for a different file location.
Asynchronous exceptions are thrown by either a different user thread, or by the runtime system itself. For example, in the async package, race will kill the longer-running thread with an asynchronous exception. Similarly, the timeout function will kill an action which has run for too long. And epic foreshadowment the runtime system will kill threads which appear to be deadlocked on MVars or STM actions.

In contrast to synchronous exceptions, we almost never want to recover from asynchronous exceptions. In fact, this is a common mistake in Haskell code, and from what I've seen has been the largest source of confusion and concern amongst users when it comes to Haskell's runtime exception system.

Alright, got that? Catch the synchronous ones and recover, never recover from asynchronous exceptions**.

* You could argue that creating an exception in a pure value, via functions like error and undefined, is a third category. We're going to ignore that for now.

** You may be wondering "why make it possible to catch async exceptions if we can't recover from them?" The answer is that we still want to be able to perform cleanup actions, like closing file handles. This is the topic of the aforementioned upcoming blog post.

Catching all exceptions

Quite a few years ago, I wrote a School of Haskell article on how to catch all exceptions. I'm not going to rehash that article here, since (1) the content is still current, and (2) I'll likely be sharing a new blog post in a few days with a newer approach. But the gist of it is that if we want to catch all synchronous exceptions in an action without catching asynchronous ones, we run that action in a separate thread, and then grab the result from that thread using a mutable variable. The SoH article uses the async package, which hides away some of the details. However, the core idea is easy enough to express as I did above:

tryAny :: IO a -> IO (Either SomeException a)
tryAny action = do
    var <- newEmptyMVar
    uninterruptibleMask_ $ forkIOWithUnmask $ \restore -> do
        res <- try $ restore action
        putMVar var res
    takeMVar var

Let's address some important points in the design of this function, since it has to step quite carefully around asynchronous exceptions:

We use forkIOWithUnmask to ensure that the newly created thread has asynchronous exceptions masked. This means that we will receive no async exceptions in the new thread, until we wrap an exception with the restore function.
We restore the user's action, meaning async exceptions can be received there. But we wrap that restore action with a try, catching all synchronous and asynchronous exceptions.
Since outside of the try we still have all async exceptions masked, there's nothing preventing the putMVar call in the child thread from being called. This means that, in all cases of the action terminating, the putMVar will ensure that the var is filled up.
The takeMVar in the parent thread will block until the child thread completes. And since we're guaranteed that the var will always be filled once the action terminates, we seem (once again, epic foreshadowment) to be guaranteed that the takeMVar call will always wait for action to somehow terminate, and once it's terminated will return either the exception is died with or the success result.

Deadlock protection

Alright, I've foreshadowed enough already, I'm guessing everyone's figured out where the problem is going to come. Sure enough, if we look at the program I started this post with, we see that tryAny mayThrow3 >>= print does not catch the exception generated by mayThrow3, but instead itself dies from GHC's deadlock protection (BlockedIndefinitelyInMVar). Simon Marlow explained the reason for this to me twice, once in his (really really good) book, and once when I raised issue #14 on async because I hadn't been smart enough to connect the dots from the book to my actual issue. As long as I'm linking so much to Github, you may also like seeing my pull request to work around issue #14.

Don't worry, I'm not going to make you read all of that. The summary is pretty simple: GHC's run time system has very straightforward deadlock detection. When it finds a group of threads that are all blocked on blocking mutable variables (MVars or STM variables), and finds that no other threads have references to the variables it question, it determines that all of the threads are deadlocked, and sends all of them asynchronous exceptions (either BlockedIndefinitelyOnMVar or BlockedIndefinitelyOnSTM).

So let's analyze tryAny mayThrow3 in excruciating detail. The series of events here is:

Create a new empty MVar to allow the child thread to pass back the result of mayThrow3 to the parent thread.
Fork a new child thread (yes, exceptions are masked, but that's not terribly important for us).
Back in the parent thread: block waiting for that empty MVar to get filled.
In the child thread, set up try to catch all exceptions, and then run the user's action (with async exceptions unmasked).
As it happens, I carefully crafted mayThrow3 to deadlock on an MVar. So now the child thread can't make progress, and is fully deadlocked.

OK, the stage is set: the main thread is blocked on an MVar for the result. As I proved above, we know for a fact that this MVar will ultimately be filled with the result of the user action, once the user action completes or dies with an exception. However, GHC doesn't know this, it just sees the main thread blocked on an MVar action, and that only the main thread and child thread have references to that variable.

In the child thread, we've created a new MVar deadlock. This is a true and legitimate deadlock (if you don't believe me, go back and look at the implementation of mayThrow3). So GHC decides that both the main thread and the child thread are currently blocked on MVar actions, and there is no other thread that can unblock either of them. And GHC is correct. However, the next part is the unfortunate part:

GHC kills both threads with an asynchronous exceptions

Why is this unfortunate? Because we know that if the child thread receives the async exception, it will exit, the result MVar will get filled, and the main thread can complete successfully. That's exactly the behavior we want, but not the behavior GHC is able to deliver to us.

The workaround

The workaround I'm about to demonstrate is not elegant, and it leaves a bad taste in my mouth. If anyone has a better idea, let me know. The idea is this: we know that there's no legitimate reason for the takeMVar inside tryAny to be killed with a BlockedIndefinitelyOnMVar, even if GHC doesn't know that. So we outsmart GHC by catching and ignoring that specific exception. And just in case our logic is wrong and there's a flawed implementation here, we only catch-and-ignore a fixed number of times (arbitrary choice: 10) to avoid creating a real deadlock.

I've added a commit for enclosed-exceptions that implements this behavior, but for the lazy, here's a simplified version of it:

tryAny :: IO a -> IO (Either SomeException a)
tryAny action = do
    var <- newEmptyMVar
    uninterruptibleMask_ $ forkIOWithUnmask $ \restore -> do
        res <- try $ restore action
        putMVar var res
    retryCount 10 (takeMVar var)
  where
    retryCount cnt0 action =
        loop cnt0
      where
        loop 0 = action
        loop cnt = action `catch`
            \BlockedIndefinitelyOnMVar -> loop (cnt - 1)

Swallow the bit of vomit rising in the back of your throat, try this out in the original program, and see if it solves the issue. You should see the following output:

Right 5
Left throw a normal exception
Left thread blocked indefinitely in an MVar operation
done!

Note that the "thread blocked" has actually been caught, and the final done! line is actually printed.

Problem with STM-based tryAny

tryAny was already implemented safely in terms of the async package, which uses STM in place of MVars as we did here. Mitchell Rosen came up with a really convincing demonstration for why we shouldn't use STM for this purpose. To quote, with the program:

#!/usr/bin/env stack
-- stack --resolver lts-6.3 runghc --package yaml
{-# LANGUAGE OverloadedStrings #-}

import Control.Concurrent.STM
import Data.Yaml

main = do
    mval <- atomically (return $! decode "")
    print (mval :: Maybe Value)

We get the rather surprising output:

Control.Concurrent.STM.atomically was nested

The reason is that decode is calling a bunch of IO functions as FFI calls to the libyaml library, and is wrapped in unsafePerformIO to make the morally-pure YAML-decoding action not live in IO. Under the surface, it uses tryAny to avoid unexpected exceptions from breaking out of the unsafePerformIO dungeon it lives in. That's all well and good, but now if you use decode inside an atomically block, you have two layers of STM occurring, which is strictly forbidden.

Extra credit Figure out why I needed to use $! to make this fail.

Our distributed computing problem

This problem arose about two months ago, and is far more complicated than the examples I've discussed until now. In fact, I've so far been unsuccessful at reproducing this with a minimal test case, which is definitely annoying. Nonetheless, I can explain what basically happened:

Have a queue of work items to be performed, represented by a TChan
Have one thread (filler) grabbing work items from somewhere (like a socket) and putting them on the TChan
Have another thread (worker) grabbing work items and performing them
The filler has logic to detect when no more items are incoming, and exits
Run these two threads together using the race function, which guarantees that once the filler exits, the worker will be killed too

Unfortunately, this runs into the same problem I described above, at least in some circumstances. The thread calling race is blocked on an MVar under the surface, which filler and worker both have access to. filler completes, puts something in the MVar, and then disappears, losing its reference to the MVar and the TChan. Meanwhile, worker is still blocked waiting on the TChan, to which only it has access now, resulting in a deadlock condition.

In our application, we ended up with a situation where the worker thread was deadlocked on the TChan, and the thread calling race was blocked on the MVar. It's a slightly more complicated case than this though, since in the simple race case, the thread calling race doesn't actually get blocked on the MVar. Nonetheless, it's another interesting case where the deadlock prevention kills too many threads.

Fortunately for our application, there's a very simple - and more robust - solution, which I recommend to anyone doing this kind of concurrent programming. The problem was that our worker thread had no logic to terminate itself, and instead just waited to be killed by race. Instead, we switched over to a closeable TChan implementation, where the filler thread could explicitly closed out the TChan and then the worker would kill itself instead of waiting around for an async exception.

So to leave this blog post with some recommendations, and a small amount of extra foreshadowment (fourth time getting to use that word in one blog post!):

Explicit exit conditions are better than implicit ones
Avoid relying on async exceptions for control flow if you can
Relying on the deadlock detection to clean up your messes is a bad idea
And even if you don't want the deadlock detection to clean up your messes, it may try to do so for you badly anyway

Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.

Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.