The split-image approach to building minimal runtime Docker images

The most common pattern for using Docker to build and deploy software in an image uses a single Dockerfile to build the software and produce the image that gets deployed. The basic pattern goes:

FROM base-image
RUN install-some-extra-build-tools
COPY . /build-directory
RUN /build-directory/build-my-software
CMD /run/my/software

This works, but you end up with a great deal of unncessary cruft in the image that gets deployed. Does your software need its own source code and the tools used to build itself in order to run? Unless you're using an interpreted language like Ruby or Python, it probably doesn't, so why does it have to be in the deployed image? Disadvantages of this approach include:

Compilers are often huge (likely to be much bigger than your own software), which means the majority of the deployed image's contents are not used. That's a lot of pointless overhead.
Since your source code and vendor libraries are in the image, a security hole in your software could leak proprietary information.
Every extra component introduces a potential attack vector.

Instead, we should separate concerns: use one Dockerfile to create a build environment and use that to build our software, and another to create the deployed runtime image using the artifacts generated by the build. What follows is a simple example that using one-line "Hello, world" program written in Haskell (our preferred language, of course, but also illustrative since the compiler is not generally considered small). The full example is available on Github.

The conventional approach

What do we need to build this program? Let's just do the obvious: use the official haskell image. We'll start with the "conventional" approach, and work toward something better. Here's the Dockerfile:

FROM haskell:7.10.2
# [insert additional build and runtime requirements here]
RUN mkdir /artifacts
COPY src /src/
RUN ghc -o /artifacts/hello /src/Main.hs
CMD /artifacts/hello

We build the image using docker build -t haskell-hello ., and run it:

$ docker run --rm haskell-hello
Hello, world

Great, all done and ready to deploy! So how big is the image?

$ docker inspect -f '' haskell-hello
715052740

It's ~700 MB, just to run a "Hello, world" program! There must be a better way.

The split-image approach

What do we need in the image to actually run this tiny program? Not very much at all; just a minimal Linux system with the libgmp shared library (which all programs compiled with GHC need unless special options are used). Conveniently, there is the ~4 MB haskell-scratch image for that (see our Haskell Web Server in a 5MB Docker Image blog post, but note that it's too minimal for most real-world Haskell programs and suggest using something like ubuntu-with-libgmp instead). Here's the runtime image's Dockerfile (in the run/ subdirectory) to create the runtime image that we'll deploy:

FROM fpco/haskell-scratch:integer-gmp
# [insert additional runtime requirements here]
COPY artifacts /artifacts/
CMD /artifacts/hello

Where does the contents of the artifacts directory come from? That's the job of the build image's Dockerfile (in the build/ subdirectory):

FROM haskell:7.10.2
# [insert additional build requirements here]
VOLUME /artifacts
VOLUME /src
CMD ghc -o /artifacts/hello /src/Main.hs

This uses the same official Haskell image and compilation command as our original Dockerfile, but it uses VOLUME mounts and the CMD instruction instead. That means the source code is compiled when you docker run the image, not when you docker build it. That, in turn, allows us to use VOLUME mounts (which cannot be used with docker build) to expose the host's run/artifacts directory to the build, so that it puts the artifacts where the runtime image's Dockerfile looks for them. To put it all together, run these commands:

$ docker build -t build_haskell-hello build/
$ docker run --rm \
    --volume="$PWD/build/src:/src" \
    --volume="$PWD/run/artifacts:/artifacts" \
    build_haskell-hello
$ docker build -t haskell-hello run/

Notice that we also mounted the source code from the host. While we could have continued COPYing the source code into the image, mounting it has some advantages. You don't end up with a bunch of large build images full of intermediate files for every time you change the code (that you have to remember to clean up), and you can do incremental builds since intermediate files are preserved.

That was more complicated, but did it make a difference? Well...

$ docker inspect -f '' haskell-hello
5466526

Down to ~5.5 MB, over two orders of magnitude better. I'd say that was worth it!

Of course, everyone's projects are different, and require different trade-offs, so while the above is illustrative of the approach, you will tweak it as you see fit. You may prefer to COPY the source code into the image to minimize risk of leakage between iterations (at the expense of time and disk space). You may want to clear the artifacts directory between builds for the same reason. For more complex project, there will be OS requirements shared between the build and runtime images, so it often makes sense to derive both from a common parent.

Stack support

At FP Complete, we use and recommend this approach for deploying production software with Docker, but without easy-to-use tool support it is a bit cumbersome. Unsurprisingly, Stack has excellent support for this approach. The stack image container command will create a runtime image from artifacts generated during the build (optionally using Docker for the build as well). See Yesod hosting with Docker and Kubernetes for an example, and Docker section of the user's guide for more details.

Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.

Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.