Even among skilled enterprise IT departments, it is too rare
that software is thoroughly tested before deployment. Failed
deployments mean costly downtime, service failures, upset users,
and even security breaches. How can we verify that a solution is
actually ready to deploy, free of serious defects?
You probably need
more kinds of tests
We’ve all seen “ready to deploy” applications that do not work
as expected once deployed. Often it’s because the production system
is not in fact identical to the staging system, so the testing
wasn’t valid. This can be prevented with another devops best
practice, automated deployments -- which we talked about in
this
recent post and will return to again.
But often the problem is that the software, even on a properly
configured test system and staging system, was never fully
tested. Before an app is approved for deployment, your QA
system (mostly automated) should complete:
- Success tests
- Failure tests
- Corner-case tests
- Randomized tests (also called mutation tests, fuzz tests)
- Load and performance tests
- Usability tests
- Security tests
(If you were not using automated, reproducible deployments, you
would also have to do explicit pre-tests of the deployment script
itself -- but you are doing fully automated deployments,
right? If you aren’t, consider moving to the sorts of tools we use
in FP Deploy -- like Docker and Kubernetes, Puppet and
Ansible.)
In the rest of this article we’ll see how each kind of testing
adds something different and important.
Most operations teams won’t accept a deployment from the
engineering group (dev or test or QA) unless the system has at
least passed success testing. When presented with correct
inputs, does the system generate correct outputs and not crash?
It’s the most basic testing. Yet it’s often left incomplete. To
avoid serious omissions, use at least this checklist:
Believe it or not, that was the easy part. For enterprise
production quality, you still want to test your system six more
ways. Jumping right in to number two...
Failure testing: when you break the law, do you go to jail?
Your testing is all under an automated set of continuous
integration (CI) scripts, right? (If not, time to look into that.)
But do you have tests that force all of the specified error
conditions to occur? Do you pass in every identified kind of
prohibited/invalid input? Do you also create all the realistic
external error conditions, like a network link failure, a timeout,
a full disk, low memory (with a tool like Chaos Monkey)?
A test suite doesn’t check whether specified error conditions
actually generate the right errors is not complete. We recommend
that the QA team present a report showing a list of all existing
tests, what conditions they purport to test, and when they were
last run and passed.
Corner-case
testing: try something crazy
Maybe you’ve heard the joke: a QA engineer walks into a bar, and
orders a beer, and 2 beers, and 20 beers, and 0 beers, and a
million beers, and -1 beers. And a duck.
Corner-case testing means success testing using unrealistic
but legal inputs. Often, developers write code that works
correctly in typical cases, but fails in the extremes.
Before deploying, consider: are any of your users going to try
anything crazy? What would be the oddest things still permitted,
and what happens if you try? Who tested that and verified that the
output was correct? Correct code works on all permitted
inputs, not just average ones. This is a fast way to find
bugs before deployment -- push the system right to the edge of what
it should be able to do.
Corner cases vary by application, but here are some typical
examples to spark your thinking. Where strings are permitted, what
happens if they are in a very differently structured language, like
Chinese or Arabic? What happens if they are extremely long? Where
numbers are permitted, what happens if they are very large, very
small, zero, negative? And why is the permitted range as big as it
is; should it be reduced? Is it legal to request output of a
billion records, and what happens if I do? Where options are
permitted, what if someone chooses all of them, or a bizarre
mixture? Can I order a pizza with 30 toppings? Can I prescribe 50
medicines, at 50 bottles each, for a sample patient? What happens
if nested or structured inputs are extremely complex? Can I send an
email with 100 embeddings, each of which is an email with 100
embeddings?
If an application hasn’t been tested with ridiculous-but-legal
inputs, no one really knows if it’s going to hold up in
production.
Randomized
testing: never saw that before!
No human team can test every possible combination of cases and
actions. And that may be okay, because many projects find more bugs
per unit of effort through randomized generation of test
cases than any other way.
This means writing scripts that start with well-understood
inputs, and then letting them make random, arbitrary changes to
these inputs and run again, then change and run again, many
thousands of times. Even if it’s not realistic to test the outputs
for correctness (because the script may be unable to tell what its
crazy inputs were supposed to do), the outputs can be tested for
structural validity -- and the system can be watched for not
crashing, and not generating any admin alerts or unhandled errors
or side effects.
It’s downright surprising how fast you can find bugs in a
typical unsafe language (like Python or C or Java) through simple
mutation testing. Extremely safe languages like Haskell tend to
find these bugs at compile time, but it may still be worth trying
some randomized testing. Remember, machine time is cheap; holes in
deployed code are very expensive.
Companies with good devops say heavy load is one of the top
remaining sources of failure. The app works for a while, but fails
when peak user workload hits. Be on the lookout for conditions that
could overload your servers, and make sure someone is forcing them
to happen on the staging system -- before they happen in
production.
Consider whether your test and staging systems are similar
enough to your production system. If your production system accepts
5000 requests per second on 10 big machines, and your test system
accepts 5 per second on one tiny VM, how will you know about
database capacity issues or network problems?
A good practice is to throw enormous, concurrent, simulated load
at your test system that (1) exceeds any observed real-world load
and (2) includes a wide mix of realistic inputs, perhaps a stream
of historic real captured inputs as well as random ones. This
reduces the chance that you threw a softball at the system when
real users are going to throw a hardball.
Performance testing can include sending faster and faster inputs
until some hardware resource becomes saturated. (You may enjoy
watching system monitor screens as this is happening!) Find the
bottleneck -- what resource can you expect to fail first in
production? How will you prevent it? Do your deployment scripts
specify an abundance of this resource? Have you implemented cloud
auto-scaling so that new parallel servers are fired up when the
typically scarce resource (CPU, RAM, network link, …) gets too
busy?
Usability testing: it doesn't work if people can't use it
Most people consider this to be outside the realm of devops. Who
cares if users find your system confusing and hard to use? Well,
lots of people, but why should a devops person care?
What will happen to your production environment if a new feature
is deployed and suddenly user confusion goes through the roof? Will
they think there’s a bug? Will support calls double in an hour, and
stay doubled? Will you be forced to do a rollback?
User interface design probably isn’t your job. But if you are
deploying user-facing software that has not been usability tested,
you’re going to hear about it. Encourage your colleagues to do real
testing of their UI on realistic, uninitiated users (not just team
members who know what the feature is supposed to do), or at least
skeptical test staff who know how to try naive things on purpose,
before declaring a new feature ready to deploy.
Security testing:
before it's too late
One of the worst things you can do is to cause a major security
breach, leading to a loss of trust and exposure of users’ private
data.
Testing a major, public-facing, multi-server device (a
distributed app) for security is a big topic, and I’d be doing a
disservice by trying to summarize it in just a couple of
paragraphs. We’ll return in future posts to both the verification
and testing side, and the design and implementation side, of
security. A best practice is to push quality requirements upstream,
letting developers know that security is their concern too, and
ensuring that integration-test systems use a secure automated
deployment similar or identical to the production system. Don’t let
developers say “I assume you’ll secure this later.”
Meanwhile, as a devops best practice, your deployment and
operations team should have at least one identified security
expert, a person whose job includes knowing about all the latest
security test tools and ensuring that they are being used where
appropriate. Are you checking for XSS attacks and SQL injection
attacks? DDoS attacks? Port-scan attacks? Misconfigured default
accounts? It’s easy to neglect security until it’s too late, so
make someone responsible.
Security holes can appear in application code, or in the
platform itself (operating system, library packages, and
middleware). At a minimum, security testing should include running
standard off-the-shelf automated scanning software that looks for
known ways to intrude, most often taking advantage of poor default
configurations, or of platform components that have not been
upgraded to the latest patch level. Run an automated security scan
before moving a substantially changed system from staging into
production. Automated testing is almost free, in stark contrast to
the costly manual clean-up after a breach.
Conclusions
Wow, that’s a lot of testing! More than a lot of companies
actually do. Yet it’s quite hard to look back through that list and
find something that’s okay to omit. Clearly the devops pipeline
doesn’t begin at the moment of deployment, but sooner, in the
development team itself. That means developers and testers taking
responsibility for delivering a quality product that’s actually
ready to deploy.
As usual, systems thinking wins. A good operations team doesn’t
say “give us what you’ve got, and we’ll somehow get it online.” A
good operations team says “we offer deployment and operation
services, and here’s how you can give us software that will deploy
successfully into our production environment.”
We hope you’ll keep reading our blog and making use of what we
have learned. Thanks for spending time with FP Complete.
Contact us to help your engineering and devops
teams
Subscribe to our blog via email
Email subscriptions come from our Atom feed and are handled by Blogtrottr. You will only receive notifications of blog posts, and can unsubscribe any time.
Do you like this blog post and need help with Next Generation Software Engineering, Platform Engineering or Blockchain & Smart Contracts? Contact us.