stack: more binary package sharing

This blog post describes a new feature in stack. Until now, multiple projects using the same snapshot could share the binary builds of packages. However, two separate snapshots could not share the binary builds of their packages, even if they were substantially identical. That's now changing.

tl;dr: stack will now be able to install new snapshots much more quickly, with less disk space usage, than previously.

This has been a known shortcoming since stack was first released. It's not coincidental that this support is being added not long after a similar project completed for Cabal. Ryan Trinkle- Vishal's mentor on the project- described the work to me a few months back, and I decided to wait to see the outcome of the project before working on the feature in stack.

The improvements to Cabal here are superb, and I'm thrilled to see them happening. However, after reviewing and discussing with a few stack developers and users, I decided to implement a different approach that doesn't take advantage of the new Cabal changes. The reasons are:

As Herbert very aptly pointed out on Reddit:

Since Stack sandboxes everything maximum sharing between LTS versions can easily be implemented going back to GHC 7.0 without this new multi-instance support.

This multi-instance support is needed if you want to accomplish the same thing without isolated sandboxes in a single package db.
There are some usability concerns around a single massive database with all packages in it. Specifically, there are potential problems around getting GHC to choose a coherent set of packages when using something like ghci or runghc. Hopefully some concept of views will be added (as Duncan described in the original proposal), but the implications still need to be worked out.
stack users are impatient (and I mean that in the best way possible). Why wait for a feature when we could have it now? While the Cabal Google Summer of Code project is complete, the changes are not yet merged to master, much less released. stack would need to wait until those changes are readily available to end users before relying on them.

stack's implementation

I came up with some complicated approaches to the problem, but ultimately a comment from Aaron Wolf rang true:

check the version differences and just copy compiled binaries from previous LTS for unchanged items

It turns out that this is really easy. The implementation ends up having two components:

Whenever a snapshot package is built, write a precompiled cache file containing the filepaths of the library's .conf file (from inside the package database) and all of the executables installed.
Before building a snapshot package, check for a precompiled cache file. If the file exists, copy over the executables and register the .conf file into the new snapshots database.

That precompiled cache file's path looks something like this:

/home/vagrant/.stack/precompiled/ghc-7.10.2/1.22.4.0/aeson-0.8.0.2/Vr6rCTNr+UeoWMN1qGJGhFfxIDSFqTgJixKuD6TtVEQ\=

This encodes the GHC version, Cabal version, package name, and package version. The last bit is a hash of all of the configuration information, including flags, GHC options, and dependencies. We then hash those flags and put them in the filepath, ensuring that when we look up a precompiled package, we're getting something that matches what we'd be building ourselves now.

The reason we can get away with this approach in stack is because of the invariants of a snapshot, namely: each snapshot has precisely one version of a package available, and therefore we have no need to deal with the new multi-instance installations GHC 7.10 supports. This also means no concern around views: a snapshot database is by its very nature a view.

Advantages

Decreased compile times
Decreased disk space usage

Downsides

You can't reliably delete a single snapshot, as there can be files shared between different snapshots. Deleting a single snapshot was never an officially supported feature previously, but if you knew what you were doing, you could do it safely.

After discussing with others: this trade-off seems acceptable: the overall decrease in disk space usage means that the desire to delete a single snapshot will be reduced. When real disk space reclaiming needs to happen, the recommended approach will be to wipe all snapshots and start over, which (1) will be an infrequent occurrence, and (2) due to the faster compile times, will be less burdensome.

Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.

Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.