DevOps for (Skeptical) Developers

In this post, I describe my personal journey as a developer skeptical of the seemingly ever-growing, ever more complex, array of "ops" tools. I move towards adopting some of these practices, ideas and tools. I write about how this journey helps me to write software better and understand discussions with the ops team at work.

Table of Contents

On being skeptical

I would characterise my attitudes to adopting technology in two stages:

Firstly, I am conservative and dismissive, in that I will usually disregard any popular new technology as a bandwagon or trend. I'm a slow adopter.
Secondly, when I actually encounter a situation where I've suffered, I'll then circle back to that technology and give it a try, and if I can really find the nugget of technical truth in there, then I'll adopt it.

Here are some things that I disregarded for a year or more before trying: Emacs, Haskell, Git, Docker, Kubernetes, Kafka. The whole NoSQL trend came, wrecked havoc, and went, while I had my back turned, but I am considering using Redis for a cache at the moment.

The humble app

If you’re a developer like me, you’re probably used to writing your software, spending most of your time developing, and then finally deploying your software by simply creating a machine, either a dedicated machine or a virtual machine, and then uploading a binary of your software (or source code if it’s interpreted), and then running it with the copy pasted config of systemd or simply running the software inside GNU screen. It's a secret shame that I've done this, but it's the reality.

You might use nginx to reverse-proxy to the service. Maybe you set up a PostgreSQL database or MySQL database on that machine. And then you walk away and test out the system, and later you realise you need some slight changes to the system configuration. So you SSH into the system and makes the small tweaks necessary, such as port settings, encoding settings, or an additional package you forgot to add. Sound familiar?

But on the whole, your work here is done and for most services this is pretty much fine. There are plenty of services running that you have seen in the past 30 years that have been running like this.

Disk failures are not that common

Rhetoric about processes going down due to a hardware failure are probably overblown. Hard drives don’t crash very often. They don’t really wear out as quickly as they used to, and you can be running a system for years before anything even remotely concerning happens.

Auto-deployment is better than manual

When you start to iterate a little bit quicker, you get bored of manually building and copying and restarting the binary on the system. This is especially noticeable if you forget the steps later on.

If you’re a little bit more advanced you might have some special scripts or post-merge git hooks, so that when you push to your repo it would apply to the same machine and you have some associated token on your CI machine that is capable of uploading a binary and running a command like copy and restart (e.g. SSH key or API key). Alternatively, you might implement a polling system on the actual production system which will check if any updates have occurred in get and if so pull down a new binary. This is how we were doing things in e.g. 2013.

Backups become worth it

Eventually, if you're lucky, your service starts to become slightly more important; maybe it’s used in business and people actually are using it and storing valuable things in the database. You start to think that back-ups are a good idea and worth the investment.

You probably also have a script to back up the database, or replicate it on a separate machine, for redundancy.

Deployment staging

Eventually, you might have a staged deployment strategy. So you might have a developer testing machine, you might have a QA machine, a staging machine, and finally a production machine. All of these are configured in pretty much the same way, but they are deployed at different times and probably the system administrator is the only one with access to deploy to production.

It’s clear by this point that I’m describing a continuum from "hobby project" to "enterprise serious business synergy solutions".

Packaging with Docker is good

Docker effectively leads to collapsing all of your system dependencies for your binary to run into one contained package. This is good, because dependency management is hell. It's also highly wasteful, because its level of granularity is very wide. But this is a trade-off we accept for the benefits.

Custodians multiple processes are useful

Docker doesn’t have much to say about starting and restarting services. I’ve explored using CoreOS with the hosting provider Digital Ocean, and simply running a fresh virtual machine, with the given Docker image.

However, you quickly run into the problem of starting up and tearing down:

When you start the service, you need certain liveness checks and health checks, so if the service fails to start then you should not stop the existing service from running, for example. You should keep the existing ones running.
If the process fails at any time during running then you should also restart the process. I thought about this point a lot, and came to the conclusion that it’s better to have your process be restarted than to assume that the reason it failed was so dangerous that the process shouldn’t start again. Probably it’s more likely that there is an exception or memory issue that happened in a pathological case which you can investigate in your logging system. But it doesn’t mean that your users should suffer by having downtime.
The natural progression of this functionality is to support different rollout strategies. Do you want to switch everything to the new system in one go, do you want it to be deployed piece-by-piece?

It’s hard to fully appreciate the added value of ops systems like Kubernetes, Istio/Linkerd, Argo CD, Prometheus, Terraform, etc. until you decide to design a complete architecture yourself, from scratch, the way you want it to work in the long term.

Kubernetes provides exactly that

What system happens to accept Docker images, provide custodianship, roll out strategies, and trivial redeploy? Kubernetes.

It provides this classical monitoring and custodian responsibilities that plenty of other systems have done in the past. However, unlike simply running a process and testing if it’s fine and then turning off another process, Kubernetes buys into Docker all the way. Processes are isolated from each other, in both the network on the file system. Therefore, you can very reliably start and stop the services on the same machine. Nothing about a process's machine state is persistent, therefore you are forced to design your programs in a way that state is explicitly stored either ephemerally, or elsewhere.

In the past it might be a little bit scarier to have your database running in such system, what if it automatically wipes out the database process? With today’s cloud base deployments, it's more common to use a managed database such as that provided by Amazon, Digital Ocean, Google or Azure. The whole problem of updating and backing up your database can pretty much be put to one side. Therefore, you are free to mess with the configuration or topology of your cluster as much as you like without affecting your database.

Declarative is good, vendor lock-in is bad

A very appealing feature of a deployment system like Kubernetes is that everything is automatic and declarative. You stick all of your configuration in simple YAML files (which is also a curse because YAML has its own warts and it's not common to find formal schemas for it). This is also known as "infrastructure as code".

Ideally, you should have as much as possible about your infrastructure in code checked in to a repo so that you can reproduce it and track it.

There is also a much more straight-forward path to migrate from one service provider to another service provider. Kubernetes is supported on all the major service providers (Google, Amazon, Azure), therefore you are less vulnerable to vendor lock-in. They also all provide managed databases that are standard (PostgreSQL, for example) with their normal wire protocols. If you were using the vendor-specific APIs to achieve some of this, you'd be stuck on one vendor. I, for example, am not sure whether to go with Amazon or Azure on a big personal project right now. If I use Kubernetes, I am mitigating risk.

With something like Terraform you can go one step further, in which you write code that can create your cluster completely from scratch. This is also more vendor independent/mitigated.

More advanced rollout

Your load balancer and your DNS can also be in code. Typically a load balancer that does the job is nginx. However, for more advanced deployments such as A/B or green/blue deployments, you may need something more advanced like Istio or Linkerd.

Do I really want to deploy a new feature to all of my users? Maybe, that might be easier. Do I want to deploy a different way of marketing my product on the website to all users at once? If I do that, then I don’t exactly know how effective it is. So, I could perhaps do a deployment in which half of my users see one page and half of the users see another page. These kinds of deployments are straight-forwardly achieved with Istio/Linkerd-type service meshes, without having to change any code in your app.

Relationship between code and deployed state

Let's think further than this.

You've set up your cluster with your provider, or Terraform. You've set up your Kubernetes deployments and services. You've set up your CI to build your project, produce a Docker image, and upload the images to your registry. So far so good.

Suddenly, you’re wondering, how do I actually deploy this? How do I call Kubernetes, with the correct credentials, to apply this new Doctor image to the appropriate deployment?

Actually, this is still an ongoing area of innovation. An obvious way to do it is: you put some details on your CI system that has access to run kubectl, then set the image with the image name and that will try to do a deployment. Maybe the deployment fails, you can look at that result in your CI dashboard.

However, the question comes up as what is currently actually deployed on production? Do we really have infrastructure as code here?

It’s not that I edited the file and that update suddenly got reflected. There’s no file anywhere in Git that contains what the current image is. Head scratcher.

Ideally, you would have a repository somewhere which states exactly which image should be deployed right now. And if you change it in a commit, and then later revert that commit, you should expect the production is also reverted to reflect the code, right?

ArgoCD

One system which attempts to address this is ArgoCD. They implement what they call "GitOps". All state of the system is reflected in a Git repo somewhere. In Argo CD, after your GitHub/Gitlab/Jenkins/Travis CI system has pushed your Docker image to the Docker repository, it makes a gRPC call to Argo, which becomes aware of the new image. As an admin, you can now trivially look in the UI and click "Refresh" to redeploy the new version.

Infra-as-code

The common running theme in all of this is infrastructure-as-code. It’s immutability. It’s declarative. It’s removing the number of steps that the human has to do or care about. It’s about being able to rewind. It’s about redundancy. And it’s about scaling easily.

When you really try to architect your own system, and your business will lose money in the case of ops mistakes, then you start to think that all of these advantages of infrastructure as code start looking really attractive.

But before you really sit down and think about this stuff, however, it is pretty hard to empathise or sympathise with the kind of concerns that people using these systems have.

There are some downsides to these tools, as with any:

Docker is quite wasteful of time and space
Kubernetes is undoubtedly complex, and leans heavily on YAML
All abstractions are leaky, therefore tools like this all leak

Where the dev meets the ops

Now that I’ve started looking into these things and appreciating their use, I interact a lot more with the ops side of our DevOps team at work, and I can also be way more helpful in assisting them with the information that they need, and also writing apps which anticipate the kind of deployment that is going to happen. The most difficult challenge typically is metrics and logging, for run-of-the-mill apps, I’m not talking about high-performance apps.

One way way to bridge the gap between your ops team and dev team, therefore, might be an exercise meeting in which you do have a dev person literally sit down and design an app architecture and infrastructure, from the ground up using the existing tools that we have that they are aware of and then your ops team can point out the advantages and disadvantages of their proposed solution. Certainly, I think I would have benefited from such a mentorship, even for an hour or two.

It may be that your dev team and your ops team are completely separate and everybody’s happy. The devs write code, they push it, and then it magically works in production and nobody has any issues. That’s completely fine. If anything it would show that you have a very good process. In fact, that’s pretty much how I’ve worked for the past eight years at this company.

However, you could derive some benefit if your teams are having difficulty communicating.

Finally, the tools in the ops world aren't perfect, and they're made by us devs. If you have a hunch that you can do better than these tools, you should learn more about them, and you might be right.

What we do

FP Complete are using a great number of these tools, and we're writing our own, too. If you'd like to know more, email use at [email protected].

Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.

Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.