This post is aimed at Haskellers who are roughly aware of how build
infrastructure works for Haskell.
But the topic may have general audience outside of the Haskell
community, so this post will briefly describe each part of the
infrastructure from the bottom up: compiling modules, building and
configuring packages, to downloading and storing those packages
online.
This post is a semi-continuation from last week's post on
Casa.
GHC
GHC is the de facto standard Haskell compiler. It knows how to load
packages and compile files, and produce binary libraries and
executables. It has a small database of installed packages, with a
simple command-line interface for registering and querying them:
$ ghc-pkg register yourpackage
$ ghc-pkg list
Apart from that, it doesn't know anything else about how to build
packages or where to get them.
Cabal
Cabal is the library which builds Haskell packages from a .cabal file
package description, which consists of a name, version, package
dependencies and build flags. To build a Haskell package, you create a
file (typically Setup.hs
), with contents roughly like:
import Distribution.Simple -- from the Cabal library
main = defaultMain
This (referred to as a "Simple" build), creates a program that you can
run to configure, build and install your package.
$ ghc Setup.hs
$ ./Setup configure # Checks dependencies via ghc-pkg
$ ./Setup build # Compiles the modules with GHC
$ ./Setup install # Runs the register step via ghc-pkg
This file tends to be included in the source repository of your
package. And modern package build tools tend to create this file
automatically if it doesn't already exist. The reason the build system
works like this is so that you can have custom build setups: you can
make pre/post build hooks and things like that.
But the Cabal library doesn't download packages or manage projects
consisting of multiple packages, etc.
Hackage
Hackage is an online archive of
versioned package tarballs. Anyone can upload packages to this
archive, where the package must have a version associated with it, so
that you can later download a specific instance of the package that
you want, e.g. text-1.2.4.0
. Each package is restricted to a set of
maintainers (such as the author) who is able to upload to it.
The Hackage admins and authors are able to revise the .cabal package
description without publishing a new version, and regularly do. These
new revisions supersede previous revisions of the cabal files, while
the original revisions still remain available if specifically
requested (if supported by tooling being used).
cabal-install
There is a program called cabal-install
which is able to download
packages from Hackage automatically and does some constraint solving
to produce a build plan. A build plan is when the tool picks what
versions of package dependencies your package needs to build.
It might look like:
- base-4.12.0.0
- bytestring-0.10.10.0
- your-package-0.0
Version bounds (<2.1 and >1.3) are used by cabal-install
as
heuristics to do the solving. It isn't actually known whether any of
these packages build together, or that the build plan will
succeed. It's a best guess.
Finally, once it has a build plan, it uses both GHC and the Cabal
library to build Haskell packages, by creating the aforementioned
Setup.hs
automatically if it doesn't already exist, and running the
./Setup configure
, build, etc. step.
Stackage
As mentioned, the build plans produced by cabal-install
are a best
guess based on constraint solving of version bounds. There is a matrix
of possible build plans, and the particular one you get may be
entirely novel, that no one has ever tried before. Some call this
"version hell".
To rectify this situation, Stackage is a
"stable Hackage" service, which
publishes known subsets of Hackage that are known to build and pass tests together,
called snapshots. There are nightly snapshots published, and long-term
snapshots called lts-1.0, lts-2.2, etc. which tend to steadily roll
along with the GHC release cycle. These LTS releases are intended to
be what people put in source control for their projects.
The Stackage initiative has been running since it was announced
in 2012.
stack
The stack
program was created to specifically make reproducible
build plans based on Stackage. Authors include a stack.yaml
file in their
project root, which looks like this:
snapshot: lts-1.2
packages: [mypackage1, mypackage2]
This tells stack
that:
- We want to use the
lts-1.2
snapshot, therefore any package
dependencies that we need for this project will come from there.
- That within this directory, there are two package directories that
we want to build.
The snapshot also indicates which version of GHC is used to build that
snapshot; so stack
also automatically downloads, installs and
manages the GHC version for the user. GHC releases tend to come out
every 6 months to one year, depending on scheduling, so it's common to
have several GHC versions installed on your machine at once. This is
handled transparently out of the box with stack
.
Additionally, we can add extra dependencies for when we have patched
versions of upstream libraries, which happens a lot in the fast-moving
world of Haskell:
snapshot: lts-1.2
packages: [mypackage1, mypackage2]
extra-deps: ["bifunctors-5.5.4"]
The build plan for Stack is easy: the snapshot is already a build
plan. We just need to add our source packages and extra dependencies
on top of the pristine build plan.
Finally, once it has a build plan, it uses both GHC and the Cabal
library to build Haskell packages, by creating the aforementioned
Setup.hs
automatically if it doesn't already exist, and running the
./Setup configure
, build, etc. step.
Pantry
Since new revisions of cabal files can be made available at any time,
a package identifier like bifunctors-5.5.4
is not reproducible. Its
meaning can change over time as new revisions become available. In
order to get reproducible build plans, we have to track "revisions"
such as bifunctors-5.5.4@rev:1
.
Stack has a library called Pantry to store all of this package
metadata into an sqlite database on the developer's machine. It does
so in
a content-addressable way
(CAS),
so that every variation on version and revision of a package has a
unique SHA256 cryptographic hash summarising both the .cabal package
description, and the complete contents of the package.
This lets Stackage be exactly precise. Stackage snapshots used to look
like this:
packages:
- hackage: List-0.5.2
- hackage: ListLike-4.2.1
...
Now it looks like this:
packages:
- hackage: ALUT-2.4.0.3@sha256:ab8c2af4c13bc04c7f0f71433ca396664a4c01873f68180983718c8286d8ee05,4118
pantry-tree:
size: 1562
sha256: c9968ebed74fd3956ec7fb67d68e23266b52f55b2d53745defeae20fbcba5579
- hackage: ANum-0.2.0.2@sha256:c28c0a9779ba6e7c68b5bf9e395ea886563889bfa2c38583c69dd10aa283822e,1075
pantry-tree:
size: 355
sha256: ba7baa3fadf0a733517fd49c73116af23ccb2e243e08b3e09848dcc40de6bc90
So we're able to CAS identify the .cabal file by a hash and length,
ALUT-2.4.0.3@sha256:ab8c2af4c13bc04c7f0f71433ca396664a4c01873f68180983718c8286d8ee05,4118
And we're able to CAS identify the contents of the package:
pantry-tree:
size: 355
sha256: ba7baa3fadf0a733517fd49c73116af23ccb2e243e08b3e09848dcc40de6bc90
Additionally, each and every file within the package is
CAS-stored. The "pantry-tree" refers to a list of CAS hash-len keys
(which is also serialised to a binary blob and stored in the same CAS
store as the files inside the tarball themselves). With every file
stored, we remove a lot of duplication that we had storing a whole
tarball for every single variation of a package.
Parenthetically, the 01-index.tar
that Hackage serves up with all
the latest .cabal
files and revisions has to be downloaded every
time. As this file is quite large this is slow and wasteful.
Another side point: Hackage Security is not needed or consulted for
this. CAS already allows us to know in advance whether what we are
receiving is correct or not, as stated elsewhere.
When switching to a newer snapshot, lots of packages will be updated,
but within each package, only a few files will have changed. Therefore
we only need to download those few files that are different. However,
to achieve that, we need an online service capable of serving up those
blobs by their SHA256...
Enter Casa
As announced in our casa post, Casa stands for
"content-addressable storage archive", and also means "home" in
romance languages, and it is an online service we're announcing to
store packages in a content-addressable way.
Now, the same process which produces Stackage snapshots, can also:
- Download all package versions and revisions from Hackage, and store
them in a Pantry database.
- Download all Stackage snapshots, and store them in the same Pantry
database.
- All the unique CAS blobs stored in the pantry database are then
pushed to Casa, completing the circle.
Stack can now download all its assets needed to build a package from
Casa:
- Stackage snapshots.
- Cabal files.
- Individual package files.
Furthermore, the snapshot format of Stackage supports specifying
locations other than Hackage, such as a git repository at a given
commit, or a URL with a tarball. These would also be automatically
pushed to Casa, and Stack would download them from Casa automatically
like any other package. Parenthetically, Stackage does not currently
include packages from outside of Hackage, but Stack's custom
snapshots--which use the same format--do support that.
Internal Company Casas
Companies often run their own Hackage on their own network (or
IP-limited public server) and upload their custom packages to it, to
be used by everyone in the company.
With the advent of Stack, this became less needed because it's trivial
to fork any package on GitHub and then link to the Git repo in a
stack.yaml. Plus, it's more reproducible, because you refer to a hash
rather than a mutable version. Combined with the additional
Pantry-based SHA256+length described above, you don't have to trust
GitHub to serve the right content, either.
The Casa repository is here which
includes both the server and a (Haskell) client library with which you
can push arbitrary files to the casa service. Additionally, to
populate your Casa server with everything from a given snapshot, or
all of Hackage, you can use casa-curator
from the
curator repo, which is
what we use ourselves.
If you're a company interested in running your own Casa server, please
contact us. Or, if you'd like to
discuss the possibility of caching packages in binary form and
therefore skipping the build step altogther, please
contact us. Also
contact us if you would like to
discuss storing GHC binary releases into Casa and have Stack pull from
it, to allow for a completely Casa-enabled toolchain.
Summary
Here's what we've brought to Haskell build infrastructure:
- Reliable, reproducible referring to packages and their files.
- De-duplication of package files; fewer things to download, on your
dev machine or on CI.
- An easy to use and rely on server.
- A way to run an archive of your own that is trivial to run.
When you upgrade to Stack master
or the next release of Stack, you
will automatically be using the Casa server.
We believe this CAS architecture has use in other language ecosystems,
not just Haskell. See the Casa post for more details.
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.
Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.