An overview of what containerization is, the reasons to consider
running a legacy application in Docker containers, the process to
get it there, the issues you may run into, and next steps once you
are deploying with containers. You'll reduce the stress of
deployments, and take your first steps on the path toward no
downtime and horizontal scaling.
Note: This post focuses on simplifying deployment of the
application. It does not cover topics that may require
re-architecting parts of the application, such as high-availability
and horizontal scaling.
Concepts
What is a "Legacy" App?
There's no one set of attributes that typifies all legacy apps,
but common attributes include:
- Using the local filesystem for persistent storage, with data
files intermingled with application files.
- Running many services on one server, such as a MySQL database,
Redis server, Nginx web server, a Ruby on Rails application, and a
bunch of cron jobs.
- Installation and upgrades use a hodgepodge of scripts and
manual processes (often poorly documented).
- Configuration is stored in files, often in multiple places and
intermingled with application files.
- Inter-process communication uses the local filesystem (e.g.
dropping files in one place for another process to pick up) rather
than TCP/IP.
- Designed assuming one instance on the application would run on
a single server.
Disadvantages of the
legacy approach
- Automating deployments is difficult
- If you need multiple customized instances of the application,
it's hard to "share" a single server between multiple
instances.
- If the server goes down, can take a while to replace due to
manual processes.
- Deploying new versions is a fraught manual or semi-manual
process which is hard to roll back.
- It's possible for test and production environments to drift
apart, which leads to problems in production that were not detected
during testing.
- You cannot easily scale horizontally by adding more instances
of the application.
What is
"Containerization"?
"Containerizing" an application is the process of making it able
to run and deploy under Docker containers and similar technologies
that encapsulate an application with its operating system
environment (a full system image). Since containers provide the
application with an environment very similar to having full control
of a system, this is a way to begin modernizing the deployment of
the application while making minimal or no changes to the
application itself. This provides a basis for incrementally making
the application's architecture more "cloud-friendly."
Benefits of
Containerization
- Deployment becomes much easier: replacing the whole container
image with a new one.
- It's relatively easy to automate deployments, even having them
driven completely from a CI (continuous integration) system.
- Rolling back a bad deployment is just a matter of switching
back to the previous image.
- It's very easy to automate application updates since there are
no "intermediate state" steps that can fail (either the whole
deployment succeeds, or it all fails).
- The same container image can be tested in a separate test
environment, and then deployed to the production environment. You
can be sure that what you tested is exactly the same as what is
running in production.
- Recovering a failed system is much easier, since a new
container with exactly the same application can be automatically
spun up on new hardware and attached to the same data stores.
- Developers can also run containers locally to test their work
in progress in a realistic environment.
- Hardware can be used more efficiently, by running multiple
containerized applications on a single host that ordinarily could
not easily share a single system.
- Containerizing is a good first step toward supporting
no-downtime upgrades, canary deployments, high availability, and
horizontal scaling.
Alternatives to
containerization
-
Configuration management tools like Puppet and Chef help with
some of the "legacy" issues such as keeping environments
consistent, but they do not support the "atomic" deployment or
rollback of the entire environment and application at once. This
can still go wrong partway through a deployment with no easy way to
roll everything back.
-
Virtual machine images are another way to achieve many of the
same goals, and there are cases where it makes more sense to do the
"atomic" deployment operations using entire VMs rather than
containers running on a host. The main disadvantage is that
hardware utilization may be less efficient, since VMs need
dedicated resources (CPU, RAM, disk), whereas containers can share
a single host's resources between them.
How to containerize
Preparation
Identify filesystem locations where persistent data is written
Since deploying a new version of the application is performed by
replacing the Docker image, any persistent data must be stored
outside of the container. If you're lucky, the application
already writes all its data to a specific path, but many legacy
applications spread their data all over the filesystem and
intermingle it with the application itself. Either way, Docker's
volume mounts let us expose the host's filesystem to specific
locations in the container filesystem so that data survives between
containers, so we must identify the locations to persist.
You may at this stage consider modifying the application to
support writing all data within a single tree in the filesystem, as
that will simplify deployment of the containerized version.
However, this is not necessary if modifying the application is
impractical.
Identify configuration files and values that will vary by
environment
Since a single image should be usable in multiple environments
(e.g. test and production) to ensure consistency, any configuration
values that will vary by environment must be identified so that the
container can be configured at startup time. These could take the
form of environment variables, or of values within one or more
configuration files.
You may at this stage want to consider modifying the application
to support reading all configuration from environment variables, as
that that will simplify containerizing it. However, this is not
necessary if modifying the application is impractical.
Identify
services that can be easily externalized
The application may use some services running on the local
machine that are easy to externalize due to being highly
independent and supporting communication by TCP/IP. For example, if
you run a database such as MySQL or PostgreSQL or a cache such as
Redis on the local system, that should be easy to run externally.
You may need to adjust configuration to support specifying a
hostname and port rather than assuming the service can be reached
on localhost
.
Creating the image
Create a
Dockerfile that installs the application
If you already have the installation process automated via
scripts or using a configuration management tool such as Chef or
Puppet, this should be relatively easy. Start with an image of your
preferred operating system, install any prerequisites, and then run
the scripts.
If the current setup process is more manual, this will involve
some new scripting. But since the exact state of the image is
known, it's easier to script the process than it would be when you
have to deal with the potentially inconsistent state of a raw
system.
If you identified externalizable services earlier, you should
modify the scripts to not install them.
A simple example Dockerfile:
# Start with an official Ubuntu 16.04 Docker image
FROM ubuntu:16.04
# Install prerequisite Ubuntu packages
RUN apt-get install -y <REQUIRED UBUNTU PACKAGES> \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Copy the application into the image
ADD . /app
# Run the app setup script
RUN /app/setup.sh
# Switch to the application directory
WORKDIR /app
# Specify the application startup script
COMMAND /app/start.sh
Startup script for
configuration
If the application takes all its configuration as environment
variables already, then you don't need to do anything. However, if
you have environment-dependent configuration values in
configuration files, you will need to create an application startup
script that reads these values from environment variables and then
updates the configuration files.
An simple example startup script:
#!/usr/bin/env bash
set -e
# Append to the config file using $MYAPPCONFIG environment variable.
cat >>/app/config.txt <<END
my_app_config = "${MYAPPCONFIG}"
END
# Run the application using $MYAPPARG environment variable for an argument.
/app/bin/my-app --my-arg="${MYAPPARG}"
Push the image
After building the image (using docker build
), it
must be pushed to a Docker Registry so that it can be pulled on the
machine where it will deployed (if you are running on the same
machine as the image was built on, then this is not necessary).
You can use Docker Hub for
images (a paid account lets you create private image repositories),
or most cloud providers also provide their own container registries
(e.g. Amazon ECR).
Give the image a tag (e.g. docker tag myimage
mycompany/myimage:mytag
) and then push it (e.g. docker
push mycompany/myimage:mytag
). Each image for a version of
the application should have a unique tag, so that you always know
which version you're using and so that images for older versions
are available to roll back to.
How to deploy
Deploying containers is a big topic, and this section just
focuses on directly running containers using docker
commands. Tools like docker-compose (for simple
cases where all containers run on a single server) and Kubernetes (for container orchestration
across a cluster) should be considered in real-world usage.
Externalized services
Services you identified for externalization earlier can be run
in separate Docker containers that will be linked to the main
application. Alternatively, it is often easiest to outsource to
managed services. For example, if you are using AWS, using RDS for
a database or Elasticache for a cache significantly simplifies your
life since they take care of maintenance, high availability, and
backups for you.
An example of running a Postgres database container:
docker run \
-d \
--name db \
-v /usr/local/var/docker/volumes/postgresql/data:/var/lib/postgresql/data \
postgres
The application
To run the application in a Docker container, you use a
command-line such as this:
docker run \
-d \
-p 8080:80 \
--name myapp \
-v /usr/local/var/docker/volumes/myappdata:/var/lib/myappdata \
-e MYAPPCONFIG=myvalue \
-e MYAPPARG=myarg \
--link db:db \
myappimage:mytag
The -p
argument exposes the container's port 80 on
the host's port 8080, -v
argument sets up the volume
mount for persistent data (in the
hostpath:containerpath
format), the -e
argument sets a configuration environment variable (these may both
be repeated for additional volumes and variables), and the
--link
argument links the database container so the
application can communicate with it. The container will be started
with the startup script you specified in the Dockerfile's
COMMAND
.
Upgrades
To upgrade to a new version of the application, stop the old
container (e.g., docker rm -f myapp
) and start a new
one with the new image tag (this will require a brief down time).
Rolling back is the similar, except that you use the old image
tag.
Additional considerations
"init" process (PID 1)
Legacy applications often run multiple processes, and it's not
uncommon for orphan processes to accumulate if there is no "init"
(PID 1) daemon to clean them up. Docker does not, by default,
provide such a daemon, so it's recommended to add one as the
ENTRYPOINT in your Dockerfile.
dumb-init is an example lightweight init daemon, among others.
phusion/baseimage
is a fully-featured base image that includes an init daemon in
addition to other services.
See our blog post dedicated to this topic: Docker
demons: PID-1, orphans, zombies, and signals.
Daemons and cron jobs
The usual way to use Docker containers is to have a single
process per container. Ideally, any cron jobs and daemons can be
externalized into separate containers, but this is not always
possible in legacy applications without re-architecting them. There
is no intrinsic reason why containers cannot run many processes,
but it does require some extra setup since standard base images do
not include process managers and schedulers. Minimal process
supervisors, such as runit,
are more appropriate to use in containers than full-fledged systems
like systemd. phusion/baseimage
is a fully-featured base image that includes runit and cron, in
addition to other services.
Volume-mount permissions
It's common (though not necessarily recommended) to run all
processes in containers as the root
user. Legacy
applications often have more complex user requirements, and may
need to run as a different user (or multiple processes as multiple
users). This can present a challenge when using volume mounts,
because Docker makes the mount points owned by root
by
default, which means non-root
processes will not be
able to write to them. There are two ways to deal with this.
The first approach is to create the directories on the host
first, owned by the correct UID/GID, before starting the container.
Note that since the container and host's users don't match up, you
have to be careful to use the same UID/GID as the container, and
not merely the same usernames.
The other approach is for the container itself to adjust the
ownership of the mount points during its startup. This has to
happen while running as root
, before switching to a
non-root
user to start the application.
Database migrations
Database schema migrations always present a challenge for
deployments, because the database schema can be very tightly
coupled with the application, and that makes controlling the timing
of the migration important, as well as making rolling back to an
older version of the application more difficult since database
migrations can't always be rolled back easily.
A way to mitigate this easily is to have a staged approach to
migrations. You need to make an incompatible schema change, you
split that change over two application deployments. For example, if
you want to move a piece of data from one location to another,
these would be the phases:
-
Write the data to both the old and new locations, and read it
from the new location. This means that if you roll the application
back to the previous version, any the new data is still where it
expects to find it.
-
Stop writing it to the old location.
Note that if you want to have deployments with no downtime, that
means running multiple versions of the application at the same
time, which makes this even more of a challenge.
Backing up data
Backing up from a containerized application is usually easier
than the non-containerized deployment. Data files can be backed up
from the host and you don't risk any intermingling of data files
with application files because they are strictly separated. If
you've moved databases to managed services such as RDS, those can
take care of backups for you (at least if your needs are relatively
simple).
Migrating existing data
To transition the production application to the new
containerized version, you will need to migrate the old
deployment's data. How to do this will vary, but usually the
simplest is to stop the old deployment, back up all the data, and
restore it to the new deployment. This should be practiced in
advance, and will necessitate some down time.
Conclusion
While it requires some up-front work, containerizing a legacy
application will help you get control of, automate, and minimize
the stress of deploying it. It sets you on a path toward
modernizing your application and supporting no-downtime
deployments, high availability, and horizontal scaling.
FP Complete has undertaken this process many times in addition
to building containerized applications from the ground up. If you'd
like to get on the path to modern and stress-free deployment of
your applications, you can learn more about our Devops and Consulting
services, or contact us straight away!
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.
Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.