In this post, I describe my personal journey as a developer skeptical
of the seemingly ever-growing, ever more complex, array of "ops"
tools. I move towards adopting some of these practices, ideas and
tools. I write about how this journey helps me to write software
better and understand discussions with the ops team at work.
On being skeptical
I would characterise my attitudes to adopting technology in two
stages:
- Firstly, I am conservative and dismissive, in that I will usually
disregard any popular new technology as a bandwagon or trend. I'm a
slow adopter.
- Secondly, when I actually encounter a situation where I've suffered,
I'll then circle back to that technology and give it a try, and if I
can really find the nugget of technical truth in there, then I'll
adopt it.
Here are some things that I disregarded for a year or more before
trying: Emacs, Haskell, Git, Docker, Kubernetes, Kafka. The whole
NoSQL trend came, wrecked havoc, and went, while I had my back turned,
but I am considering using Redis for a cache at the moment.
The humble app
If you’re a developer like me, you’re probably used to writing your
software, spending most of your time developing, and then finally
deploying your software by simply creating a machine, either a
dedicated machine or a virtual machine, and then uploading a binary of
your software (or source code if it’s interpreted), and then running
it with the copy pasted config of systemd or simply running the
software inside GNU screen. It's a secret shame that I've done this,
but it's the reality.
You might use nginx to reverse-proxy to the service. Maybe you set up
a PostgreSQL database or MySQL database on that machine. And then you
walk away and test out the system, and later you realise you need some
slight changes to the system configuration. So you SSH into the system
and makes the small tweaks necessary, such as port settings, encoding
settings, or an additional package you forgot to add. Sound familiar?
But on the whole, your work here is done and for most services this is
pretty much fine. There are plenty of services running that you have
seen in the past 30 years that have been running like this.
Disk failures are not that common
Rhetoric about processes going down due to a hardware failure are
probably overblown. Hard drives don’t crash very often. They don’t
really wear out as quickly as they used to, and you can be running a
system for years before anything even remotely concerning happens.
Auto-deployment is better than manual
When you start to iterate a little bit quicker, you get bored of
manually building and copying and restarting the binary on the
system. This is especially noticeable if you forget the steps later
on.
If you’re a little bit more advanced you might have some special
scripts or post-merge git hooks, so that when you push to your repo it
would apply to the same machine and you have some associated token on
your CI machine that is capable of uploading a binary and running a
command like copy and restart (e.g. SSH key or API
key). Alternatively, you might implement a polling system on the
actual production system which will check if any updates have occurred
in get and if so pull down a new binary. This is how we were doing
things in e.g. 2013.
Backups become worth it
Eventually, if you're lucky, your service starts to become slightly
more important; maybe it’s used in business and people actually are
using it and storing valuable things in the database. You start to
think that back-ups are a good idea and worth the investment.
You probably also have a script to back up the database, or replicate
it on a separate machine, for redundancy.
Deployment staging
Eventually, you might have a staged deployment strategy. So you might
have a developer testing machine, you might have a QA machine, a
staging machine, and finally a production machine. All of these are
configured in pretty much the same way, but they are deployed at
different times and probably the system administrator is the only one
with access to deploy to production.
It’s clear by this point that I’m describing a continuum from "hobby
project" to "enterprise serious business synergy solutions".
Packaging with Docker is good
Docker effectively leads to collapsing all of your system dependencies
for your binary to run into one contained package. This is good,
because dependency management is hell. It's also highly wasteful,
because its level of granularity is very wide. But this is a trade-off
we accept for the benefits.
Custodians multiple processes are useful
Docker doesn’t have much to say about starting and restarting
services. I’ve explored using CoreOS with the hosting provider Digital
Ocean, and simply running a fresh virtual machine, with the given
Docker image.
However, you quickly run into the problem of starting up and tearing
down:
- When you start the service, you need certain liveness checks
and health checks, so if the service fails to start then you should
not stop the existing service from running, for example. You should
keep the existing ones running.
- If the process fails at any time during running then you should also
restart the process. I thought about this point a lot, and came to the
conclusion that it’s better to have your process be restarted than to
assume that the reason it failed was so dangerous that the process
shouldn’t start again. Probably it’s more likely that there is an
exception or memory issue that happened in a pathological case which
you can investigate in your logging system. But it doesn’t mean that
your users should suffer by having downtime.
- The natural progression of this functionality is to support
different rollout strategies. Do you want to switch everything to the
new system in one go, do you want it to be deployed piece-by-piece?
It’s hard to fully appreciate the added value of ops systems like
Kubernetes, Istio/Linkerd, Argo CD, Prometheus, Terraform, etc. until
you decide to design a complete architecture yourself, from scratch,
the way you want it to work in the long term.
Kubernetes provides exactly that
What system happens to accept Docker images, provide custodianship,
roll out strategies, and trivial redeploy? Kubernetes.
It provides this classical monitoring and custodian responsibilities
that plenty of other systems have done in the past. However, unlike
simply running a process and testing if it’s fine and then turning off
another process, Kubernetes buys into Docker all the way. Processes
are isolated from each other, in both the network on the file
system. Therefore, you can very reliably start and stop the services
on the same machine. Nothing about a process's machine state is
persistent, therefore you are forced to design your programs in a way
that state is explicitly stored either ephemerally, or elsewhere.
In the past it might be a little bit scarier to have your database
running in such system, what if it automatically wipes out the
database process? With today’s cloud base deployments, it's more
common to use a managed database such as that provided by Amazon,
Digital Ocean, Google or Azure. The whole problem of updating and
backing up your database can pretty much be put to one
side. Therefore, you are free to mess with the configuration or
topology of your cluster as much as you like without affecting your
database.
Declarative is good, vendor lock-in is bad
A very appealing feature of a deployment system like Kubernetes is
that everything is automatic and declarative. You stick all of your
configuration in simple YAML files (which is also a curse because YAML
has its own warts and it's not common to find formal schemas for it).
This is also known as "infrastructure as code".
Ideally, you should have as much as possible about your infrastructure
in code checked in to a repo so that you can reproduce it and track
it.
There is also a much more straight-forward path to migrate from one
service provider to another service provider. Kubernetes is supported
on all the major service providers (Google, Amazon, Azure), therefore
you are less vulnerable to vendor lock-in. They also all provide
managed databases that are standard (PostgreSQL, for example) with
their normal wire protocols. If you were using the vendor-specific
APIs to achieve some of this, you'd be stuck on one vendor. I, for
example, am not sure whether to go with Amazon or Azure on a big
personal project right now. If I use Kubernetes, I am mitigating risk.
With something like Terraform you can go one step further, in which
you write code that can create your cluster completely from
scratch. This is also more vendor independent/mitigated.
More advanced rollout
Your load balancer and your DNS can also be in code. Typically a load
balancer that does the job is nginx. However, for more advanced
deployments such as A/B or green/blue deployments, you may need
something more advanced like Istio or Linkerd.
Do I really want to deploy a new feature to all of my users? Maybe,
that might be easier. Do I want to deploy a different way of marketing
my product on the website to all users at once? If I do that, then I
don’t exactly know how effective it is. So, I could perhaps do a
deployment in which half of my users see one page and half of the
users see another page. These kinds of deployments are
straight-forwardly achieved with Istio/Linkerd-type service meshes,
without having to change any code in your app.
Relationship between code and deployed state
Let's think further than this.
You've set up your cluster with your provider, or Terraform. You've
set up your Kubernetes deployments and services. You've set up your CI
to build your project, produce a Docker image, and upload the images
to your registry. So far so good.
Suddenly, you’re wondering, how do I actually deploy this? How do I
call Kubernetes, with the correct credentials, to apply this new
Doctor image to the appropriate deployment?
Actually, this is still an ongoing area of innovation. An obvious way
to do it is: you put some details on your CI system that has access to
run kubectl, then set the image with the image name and that will try
to do a deployment. Maybe the deployment fails, you can look at that
result in your CI dashboard.
However, the question comes up as what is currently actually deployed
on production? Do we really have infrastructure as code here?
It’s not that I edited the file and that update suddenly got
reflected. There’s no file anywhere in Git that contains what the
current image is. Head scratcher.
Ideally, you would have a repository somewhere which states exactly
which image should be deployed right now. And if you change it in a
commit, and then later revert that commit, you should expect the
production is also reverted to reflect the code, right?
ArgoCD
One system which attempts to address this is ArgoCD. They implement
what they call "GitOps". All state of the system is reflected in a Git
repo somewhere. In Argo CD, after your GitHub/Gitlab/Jenkins/Travis CI
system has pushed your Docker image to the Docker repository, it makes
a gRPC call to Argo, which becomes aware of the new image. As an
admin, you can now trivially look in the UI and click "Refresh" to
redeploy the new version.
Infra-as-code
The common running theme in all of this is
infrastructure-as-code. It’s immutability. It’s declarative. It’s
removing the number of steps that the human has to do or care
about. It’s about being able to rewind. It’s about redundancy. And
it’s about scaling easily.
When you really try to architect your own system, and your business
will lose money in the case of ops mistakes, then you start to think
that all of these advantages of infrastructure as code start looking
really attractive.
But before you really sit down and think about this stuff, however, it
is pretty hard to empathise or sympathise with the kind of concerns
that people using these systems have.
There are some downsides to these tools, as with any:
- Docker is quite wasteful of time and space
- Kubernetes is undoubtedly complex, and leans heavily on YAML
- All abstractions are leaky,
therefore tools like this all leak
Where the dev meets the ops
Now that I’ve started looking into these things and appreciating their
use, I interact a lot more with the ops side of our DevOps team at work,
and I can also be way more helpful in assisting them with the
information that they need, and also writing apps which anticipate the
kind of deployment that is going to happen. The most difficult
challenge typically is metrics and logging, for run-of-the-mill apps,
I’m not talking about high-performance apps.
One way way to bridge the gap between your ops team and dev team,
therefore, might be an exercise meeting in which you do have a dev
person literally sit down and design an app architecture and
infrastructure, from the ground up using the existing tools that we
have that they are aware of and then your ops team can point out the
advantages and disadvantages of their proposed solution. Certainly,
I think I would have benefited from such a mentorship, even for an
hour or two.
It may be that your dev team and your ops team are completely separate
and everybody’s happy. The devs write code, they push it, and then it
magically works in production and nobody has any issues. That’s
completely fine. If anything it would show that you have a very good
process. In fact, that’s pretty much how I’ve worked for the past
eight years at this company.
However, you could derive some benefit if your teams are having
difficulty communicating.
Finally, the tools in the ops world aren't perfect, and they're made
by us devs. If you have a hunch that you can do better than these
tools, you should learn more about them, and you might be right.
What we do
FP Complete are using a great number of these tools, and we're writing
our own, too. If you'd like to know more, email use at
[email protected].
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.
Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.