Improving Software Failure: Measure, Change, Learn

How do you treat software development failure? Do you take time to measure and learn from software failure? Or do you try to fix it quickly only after your customers complain about it?

Failure can be an opportunity to learn and get better. So how can you measure and learn from software failure, and turn failure into at least a partially positive experience?

Failure happens all the time, but if you're not measuring it, how do you know what you’re missing? Check out the video below to hear Sleuth’s CTO and co-founder, Don Brown’s take on how to learn from software failure.

If you really want to tackle software failure on your team, you need to work together. The developers, the designers, the project managers, the management, the ops — they all need to be able to work together to take something that is very expensive and not only mitigate it, but actually turn it into a positive thing.

Why software failure is a good thing

Failure happens regardless of whether you want it to or not, so you might as well turn it into a good thing that you can learn from and make it less impactful as time goes on. A job well done is a customer being satisfied, and that requires a team effort. So how do we do that? How do we take that bad thing and turn it into a good thing?

Here are three steps that can help you learn from software failure.

Step 1: Measure software failure

You may be great at making changes, but if you're not measuring it, you don't really know what you did. Anyone who's ever done any kind of performance optimization knows first, you need to measure how bad it is. Then, you need to figure out what you're going to do to make it better. You make a change, measure that change, and keep measuring to make sure you're improving. It’s a continuous process — not something you do once and you're done.

If you jump into a problem and make changes, you might be making the wrong changes. In fact, you might even be making it worse and you'll have no idea that you're making it worse. So start with measuring if you want to learn from software failure.

You also have to train your team on software failure measurements. Just because you have data doesn’t mean you know how to interpret or use it. You also can’t have one single measurement of failure.

For example, if you’re measuring uptime, you know when your homepage responds, but what about the other pages? How quickly are they responding? Are major features of your application working the way people expect? Are you experiencing data corruption? There are all these different possibilities of software failure.

You also have to consider that there are different audiences for failure, and you should be measuring different things for those different audiences. Two examples:

DORA metrics

DORA metrics help you measure generally how you're doing with software development failure. They provide a baseline, where you measure your failure to know how you compare to other teams in the industry, and then you can report that up to management. DORA metrics basically are a measurement of engineering efficiency — they measure the output speed and quality of your engineering team.

There are four DORA metrics, and two of them require measuring software failure:

Deployment Frequency: how often you deploy
Change Lead Time: when you deploy, how long it takes to go from first commit to deploying
Time to restore service, also known as Mean Time to Recovery: when you fail in production, how long it takes to restore health
Change Failure Rate: when you deploy something to production, the percentage of those deployments that fail

Error budgets: a tactical tool

Let's also talk about error budgets — it’s an advanced strategy, so it’s not for every team, but it’s something to be aware of.

Step one with error budgets is a service level agreement (SLA). If you're a team that develops software that's used by customers, then you'll probably have some SLAs built into your contracts that outline uptime, response, etc. To meet your SLA, you need to go to step two.

Step two is service level objectives (SLO), which are your objectives for your team (not something you share with customers). They’re usually more constrained than your SLA. For example, if you only want one day of downtime a month, your SLO is one day per six months. Your SLO should be a realistic number but one that aims for tighter timing.

How do you know if you’re meeting your objective? That’s where service level indicators (SLIs) come in. These are metrics that inform the objective. So, the objective is a certain amount of uptime. The measure might be, for example, your homepage uptime, your dashboard uptime, your admin interface uptime, etc. These are individual measures that combine to help you know if you're meeting your objective.

What should you measure to learn from software failure?

You can’t take action on everything, so let’s look at different types of things to measure and how useful they can be.

At the top level, look at incident management systems, like Fire Hydrant or just JIRA issues. It’s where you record when something is broken. These are generally user-defined failures.
If you want to get more detailed, then you start to look more at alerts from PagerDuty, which will tell you when a threshold was reached. These are automated.
Next, you have more tactical metrics. These are things like your error rate in Sentry, which is a way to measure how many errors you're having in a particular time, either from your back end or your front end.
And finally, there's actual metrics themselves, something like Datadog or Prometheus, where you're looking at CPU, memory usage, network latency.

So when you measure software failure, you should have some way to start at the top, when an incident is happening. Then, you should go farther and farther down to round out that understanding of software failure.

When you apply these measurements of failure to a metric, like your Change Failure Rate, each of these software failure levels will be less predictable than the one before it. If you can tie a failure back to incidents, that will be easy to understand. But at the other end of the spectrum, you might have a metric where your CPU jumps around a lot, and that’s less predictable.

On the other hand, it's an inverse relationship with actionability. If a developer’s change fails, and all you tell them is that there’s an incident, they're not really going to know what to do. They have to do a bunch of digging. But if you tell them they hit a threshold, that developer knows exactly what to look into and what to start working on, especially if it was related to a change that they just made.

So, if you're thinking about failure, think about it from those two dimensions. One is how predictable it is, how easily you can turn it into metrics so that you can track it. And the second is how actionable it is, and that's where we'll get into our next step: how to learn from software failure and get better over time.

Step 2: How do you learn from software failure?

How do you learn from software development failure and take action to improve? There are different lessons for each stage of handling software failure.

Bring developers together. If you're trying to deal with incidents happening and developers aren't connecting, bring them together on call. If your developers are only responsible for taking their code, merging it, walking away and they don't understand the impact of their change, then they're missing out on lessons to learn from software failure.

You can do a weekly review of your service level indicators. Choose three to five things to focus on, and talk about how they’re going for that week. If you see jumps in metrics, you can discuss what caused it, and how to improve the numbers. Do it in a blameless environment.

Have a postmortem. You had a bad thing happen. You worked hard to get it better. This is a great opportunity to learn from software failure, with three criteria:

It has to be blameless. No one is at fault. Pointing fingers doesn’t prevent the incident from happening again. Work together to find a way to prevent it from happening again.
Make sure it's structured. Use a template to force you to think not just about what needs fixing. Learn about the context of that fix.
Have a timeline. Do everything in unit test cases and go down to the millisecond of what happened and when, so that you can figure out when things were detected, when things were improved, and how things improved.

This is where you look at what happened and ask why it happened until you get down to the root issue. You can come up with a whole list of things you need to do in an incident, but it’s best to have one priority action — one thing that you are going to do right away.

Step three: Make changes to improve software failures

There is value in measuring, even if you don't make improvements — but you won’t be getting all you can out of the whole experience. So, you measured your software failure. You made improvements, and now you’re going to make changes. This is where you apply all of your learnings.

Performance and stability are features. They aren’t something that you do off on the side. It's something you need to make. And this is a cultural thing. You need to have a culture of stability and performance being some of the core things you do right along with feature delivery.

Here are three techniques to learn from software development failure and actually make improvements.

Give developers some slack. They’re on the front lines and know best what to do, so give them time to identify their task to improve software development efficiency and prevent failure.
Prioritize incidents. If you're going to make improvements to prevent failure, then you need to have a culture where people are empowered to make changes themselves.
Automate, automate, automate. The fewer mistakes you have the opportunity to make, the better. The more you automate your workflow and use tooling to help, the better off you’ll be.

So we measured, we learned, we changed. Then, you go back to measure, and the process keeps going, because this is not something you solved. It's a continual process that you need to keep going to continually get better and better. It might hurt, and if it does, do it more.

If learning from a software incident was painful, if doing the postmortem was painful and awkward, do it more and more because your goal is to get through that cycle faster and faster. And that's where the other two DORA metrics come into play — Deployment Frequency and Change Lead Time. It's all about the speed. The faster you can get through that loop, the more you'll learn.

Remember that software development failure will happen. You can't pretend it won’t. Instead of fearing failure, embrace it and learn from it.