This blog post is a semi-complete retelling of my debugging
adventures on a particularly crazy bug. Consider it a combination
of fun story telling for the audience and catharsis for myself.
This affects anyone trying to use AppVeyor and Stack for projects
that use the network
package with GHC 8.4 and up. If
you want to cheat and get the solution, check out the pull request.
Otherwise, strap yourselves in, it's gonna be a bumpy ride.
Initial bug report
Neil Mitchell opened a bug report against Stack.
The bug report detailed the fact that running stack
--resolver nightly build network
would fail, but using
stack unpack network
and then building the locally
unpacked source files did work. When it failed, the error message
was:
config.status: creating network.buildinfo
sed: -e expression #1, char 257: invalid reference \1 on `s' command's RHS
Initially, I didn't believe this was too serious a bug: it
appeared that unpacking locally was a decent workaround. I didn't
understand that this bug would prevent using AppVeyor for any
testing of compatibility of Haskell source code on Windows. My
initial poking and prodding at the issue was therefore minimal.
Once I was shown that this was more serious than I thought, I spent
significant time in clusters over the past few weeks trying to
understand what was going on.
First guesses
A first guess at the problem here is that we had two different
versions of the sed
executable at play, and one was
misbehaving. This even lines up with the bug triggering on
AppVeyor. AppVeyor itself will install its own sed.exe
as part of its Git installation, and Stack will provide a
sed.exe
as part of the msys
environment.
For some reason, Stack was setting different environment variables
for different ways of building, causing the buggy behavior.
However, this hypothesis left some questions:
- Why would a change in GHC version cause Stack to pass in
different environment variables?
- How could two different sources for a
sed.exe
executable have such different behaviors?
Since I didn't like this hypothesis too much, I decided to try
to disprove it. I tried installing a modified sed.exe
that would pass its arguments over to the real
sed.exe
. I got that sort of working, but it
turned out that I'd messed up the actual passing in of the
arguments. I didn't end up figuring out what I did wrong here.
However, the AppVeyor log showed me exactly
what I needed to know:
Both the stack unpack
and stack build
network
versions of the build called my modified
sed.exe
. This disproved the initial
hypothesis well enough for me: the problem was not in
different executables being called. Something else had to be
afoot.
What's GHC got to do with this?
Let's take a step back and understand exactly what Stack is
doing when this bug gets triggered. With the stack build
network
case, it starts off by downloading the
network-2.7.0.0.tar.gz
tarball and unpacking it to a
temporary directory. In the stack unpack
network-2.7.0.0
case, the downloading and unpacking happen
during the unpack
call, and happen to the current
directory, not a temporary directory. Once we have the files on the
filesystem, the following steps occur:
- Stack compiles the
Setup.hs
file from network into
Setup.exe
. This module uses the Cabal library that
ships with GHC.
- Stack calls
Setup.exe configure
with the
PATH
environment variable set appropriately to find
build tools (like GHC, gcc, bash, sed, etc).
Setup.exe configure
runs the configure script that
ships with the network package, which ultimately has the
sed
failure.
Looking at these steps, there are only two places where GHC
could have any impact:
- The directories set for the
PATH
to find the GHC
executable
- Which Cabal library is used
The former doesn't seem likely to be the problem, since the
directory holding ghc.exe
seems unrelated to how
sed.exe
is being called. However, the latter—which
Cabal library is used—seems promising. My new hypothesis was: the
GHC version at play isn't actually causing the bug, it's the Cabal
library version.
I tested this hypothesis by using GHC 8.2.2, which initially
worked perfectly, and installing a newer version of the Cabal
library. I ended up doing this manually in a remote desktop session
to AppVeyor, so I have no scripts to show. But the result was
exactly what I was hoping for: once I installed Cabal 2.2 with GHC
8.2.2, it failed just like GHC 8.4.3 did. Hurrah! We narrowed down
the source of the problem.
Unpack location
This gets us a bit closer, but we still don't know why the two
ways of unpacking the network package make a difference. But let me
paraphrase myself from above:
With the stack build network
case, it unpacks to a
temporary directory. In the stack unpack
network-2.7.0.0
case, it unpacks to the current
directory.
Alright, that's somewhat promising. New hypothesis: for some
reason, unpacking to a temporary directory causes the breakage.
Easy enough to test. Let's manually stack unpack
into
a “temporary” directory. I looked at the logs from the stack
build network
case, and found that it was building from
C:\Users\appveyor\AppData\Local\Temp\1\stack2232\network-2.7.0.0
.
So I manually ran some commands along the lines of:
mkdir C:\Users\appveyor\AppData\Local\Temp\1\stack2232\
cd C:\Users\appveyor\AppData\Local\Temp\1\stack2232\
stack unpack network-2.7.0.0
cd network-2.7.0.0
echo "resolver: ghc-8.4.3" > stack.yaml
stack build
And, like magic, I reproduced the bug! This was a great
breakthrough, it was the first time I reproduced the bug
and had access to modify the source code at the same time
(without tweaking Stack's behavior). The question was: what was
wrong with that directory?
The path length limit
I've been burned many times in the past by Window's path length
limit, which is set somewhere around 256 characters. (Yeah, it's
sort of 260, and yeah, it's sort of been relaxed. But I'm
paranoid.) So my first guess was that this temp directory was too
long. So I decided to progressively move up the directory tree to
see where things stopped failing.
I tried building inside
C:\Users\appveyor\AppData\Local\Temp\1\
next, and the
build failed. I moved up another directory and built in
C:\Users\appveyor\AppData\Local\Temp\
. And suddenly
the build succeeded. I hadn't expected that: I'd only removed two
characters from the path.
Eureka!
I stared at the subdirectory I'd just removed, 1\
.
And I had a crazy thought. I ran something like:
mkdir \1
cd \1
stack unpack network-2.7.0.0
cd network-2.7.0.0
echo "resolver: ghc-8.4.3" > stack.yaml
stack build
And, even with this tiny path, the build failed! There seemed to
be something magical about the \1
subdirectory. I was
able to use this approach to reliably reproduce the bug on my local
Windows machine too, which was a much more pleasant continuation of
debugging than remotely connecting to AppVeyor (although the RDP
setup for AppVeyor is an amazing feature). Awesome!
Next I decided to fuzz this a bit more, and built in
c:\2
. This time, my error message changed from:
sed: -e expression #1, char 213: invalid reference \1 on `s' command's RHS
to
sed: -e expression #1, char 213: invalid reference \2 on `s' command's RHS
That's when it hit me: sed
was, for some reason,
suddenly interpreting components of the file path as part of its
substitution command. Weird!
Cabal's fault!
Now I knew the problem was related to paths, and that there was
a change in Cabal 2.2 versus Cabal 2.0 that triggered this bug.
Therefore, there must be some difference between these two
versions and how they're calling the configure
script.
After some searching, I found the current call to the configure
script:
configureFile <- makeAbsolute $
fromMaybe "." (takeDirectory <$> cabalFilePath lbi) </> "configure"
...
let args' = configureFile:args ++ ["CC=" ++ ccProgShort]
...
runProgramInvocation verbosity $
(programInvocation (sh {programOverrideEnv = overEnv}) args')
{ progInvokeCwd = Just (buildDir lbi) }
Which roughly translates to:
- Find the
configure
file in the same directory of
the cabal file
- Get an absolute path to that file
- Construct a set of command line arguments with the configure
script as the first argument
- Pass all of this to
sh
, setting the working
directory to the build directory
OK, all of that seems reasonable. But for some reason, the path
to the config file is being interpreted as sed commands. It turns
out that you can replace backslashes with forward slashes in
Windows file paths, so my next step was to try this little ugly
hack:
let args' = map replace configureFile:args ++ ["CC=" ++ ccProgShort]
replace '\\' = '/'
replace c = c
And sure enough, this worked! Definitely zeroing in. My next
question was: what did this look like in Cabal 2.0? A quick call to
git diff origin/2.0 -- Cabal/Distribution/Simple.hs
revealed that:
+ configureFile <- makeAbsolute $
+ fromMaybe "." (takeDirectory <$> cabalFilePath lbi) </> "configure"
...
- args' = args ++ ["CC=" ++ ccProgShort]
+ args' = configureFile:args ++ ["CC=" ++ ccProgShort]
...
(programInvocation (sh {programOverrideEnv = overEnv}) args')
+ { progInvokeCwd = Just (buildDir lbi) }
...
- args = "./configure" : configureArgs backwardsCompatHack flags
+ args = configureArgs backwardsCompatHack flags
It seems that Cabal 2.0 was assuming that the current working
directory contained the configure
script, and
therefore wasn't passing in the full path on the command line. And
therefore, the funny paths couldn't possible be interpreted as sed
commands. Can we get back that same behavior? Sure, by make the
code instead look something like:
let configureFile = "./configure"
...
let args' = configureFile:args ++ ["CC=" ++ ccProgShort]
...
runProgramInvocation verbosity $
(programInvocation (sh {programOverrideEnv = overEnv}) args')
{ progInvokeCwd = fmap takeDirectory (cabalFilePath lbi) }
I tried this out, and the network package built correctly. I've
since filed an issue with Cabal about
this.
Cabal's fault?
Alright… but is that really a bug in Cabal? It passed a valid
path to the configure
script, which proceeded to barf.
After some discussion on the aforementioned issue, it became clear
that, in fact, this is a bug in autotools itself. Quasi Computation
sent a bug report upstream.
However, even if the bug is upstream, this is an issue we'll
need to deal with, since:
- There's no timeline on an autotools fix
- It will take time for the autotools fix to deploy
- Even after it deploys, broken
configure
scripts
will continue to exist in the wild, such as all existing versions
of the network
package
We can change the Cabal library to not trigger this bug with
some flavor of the workaround I mentioned above. However, Stack
will continue to use the version of Cabal that ships with GHC by
default, which for GHC 8.4.1, 8.4.2, and 8.4.3 will trigger this
bug.
Stack's workaround
One workaround for Stack would be kind of crazy: add some
conditional logic to, on Windows, when using Cabal 2.2, check if
the generated temporary directories include a backslash followed by
a digit, and then figure out some other directory to use. I
considered that for a bit, and then realized we had a simpler
solution, so I sent a PR consisting of:
global:
STACK_ROOT: "c:\\sr"
+ # Override the temp directory to avoid sed escaping issues
+ # See https://github.com/haskell/cabal/issues/5386
+ TMP: "c:\\tmp"
+
All we've done is changed the temporary directory used when
building on AppVeyor. This works well enough, as to my knowledge
most Windows systems do not have such weird temporary directory
paths. If others are affected by this in the future, we can make
further tweaks to either Stack itself, the Stack installer, or
documentation.
Last unsolved question
I have one final question that I haven't answered yet. I was
able to reproduce the bug using Cabal-the-library and a manually
created triggering directory. But what about using
cabal-install
to install network
? Well,
that seems to work just fine. It
seems that cabal-install
unpacks into
C:\Users\appveyor\AppData\Local\Temp\cabal-tmp-2952
.
Notice how it does not have a \1
style path
in it.
I've looked through both the Stack and
cabal-install
codebases briefly, and I cannot figure
out why they'd have different behavior for creating temporary
directories. But at this point, I've had enough of my debugging
adventure. Perhaps some curious reader will shed some light on the
last mystery here.
Takeaway
This was a challenging, frustrating, and ultimately fun bug.
It's not often we get to find bugs in such ubiquitous tools as
autoconf
. It took a lot of effort to unwind the
various layers of tooling. Getting a reliable repro locally was
hard, and AppVeyor's remote desktop was invaluable there. And the
debugging technique of “form a hypothesis, try to invalidate it”
helped me ultimately zero in on the real problem.
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.
Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.