In this 8-minute explainer video, Don Brown, Sleuth CTO and Co-founder and host of Sleuth TV on YouTube, explains what DORA / Accelerate metrics are and shares his recommendations on how to improve on each of them. Watch or read the transcript below.
Let's talk DORA metrics. Developers, developers, developers. It's all about developers. Let me start there. The DORA metrics come from an organization called DevOps Research and Assessment. This was a team put together by Google to survey thousands of development teams across multiple industries, to try to understand what makes a high performing team different than a low performing team. What they ended up settling on are these four metrics. What are the metrics?
Well, the first metric is deployment frequency. This measures how long it takes to get a change in production. The second metric is change lead time. This measures how long it takes to have a change, starting from when the developer works on it all the way till it gets into production - so it's how long can you deliver a new feature or fix? The third metric is the change failure rate. This looks at the ratio between how many times you've deployed and how many times those deployments are unsuccessful. Finally, there's mean time to recovery. Mean time to recovery is the average time it takes your team to recover from an unhealthy situation. Your production is unhealthy through bad change or whatever, how long on average can your team recover from that?
But technically, it's not about the metrics. It's about the developers. It's about the developers wanting to improve their team's efficiency and using metrics to know whether they're successful in their improving efforts. Remember, about developers, not the metrics.
Frequency is very important. In fact, it's usually the first place teams start. What you're doing is you're measuring how many times you change production. The goal of delivering code quickly to production is to ship as many times as possible. In order to make that work, you need to change the batch size to be as small as possible. In other words, ship as few changes to production at a time as you can.
Common misconception of deploy frequency is that by shipping code to production more often, you're actually creating more risk because the thinking goes that if a certain percentage of my changes to production fail or cause an incident, if I do it more, then I'm going to have more incidences. But counterintuitively, it works the exact opposite way, which is the more you're changing production with smaller changes, the better understood each of those changes are. When those changes are understood and they're small in scope, the risk of those going bad is less. Think about it this way. If I shipped a bunch of changes at once and something goes wrong, which one of those changes caused it? We have no idea. There's so many. Maybe it has 10, 20, 100 developers involved. But if I'm shipping a code one change by one, if one of those things fail, we know exactly what caused it, the developers around, and then they can fix it. By changing your batch size to be as small as possible and shipping as often as possible, you're actually reducing your overall risk.
Technically, what you want to do here is you want to ship each pull request or individual change to a production at a time. That works great for smaller teams, but it doesn't always work for a bigger team. For example, if you're a big team on say a monolith, what you want to do is a technique called release train, where you ship to production in fixed intervals throughout the day. Again, your goal is to minimize the batch size as much as possible to reduce your overall risk and increase your deployment frequency. Again, ship more smaller.
The key to change lead time is to understand what composes change lead time. Change lead time is measured from the moment the developer starts working on a change to the moment that it shipped to production. But you can actually break that time down into buckets. For example, the time a developer's working on the change, that's one bucket. Or the time that your deployment process takes to push a change all the way out to production is another bucket. By looking at things in buckets, you can see what takes the most amount of time and work on optimizing that. Change lead time is a really important metric for your company, because what it's doing is it's measuring how quickly your team is able to respond to changing conditions, events, or needs. For example, let's say your customer hits a bug, how quickly can your team create a fix and roll that fix all the way out to production? Or if you need a new feature or a small improvement, how quickly can you deliver that as well? A company that's able to deliver changes quicker tend to be more successful than a company that takes two to three months to get any kind of change out to production.
Technically I found the biggest bucket is testing. Teams will often have test as a separate step in a release process, which means that you add days or even weeks to your change lead time. Instead of having it as a separate action, integrate your testing into your development process. Have your testers teach your developers how to write automated tests from the beginning so that you don't need a separate step. To improve your change lead time, eliminate bottlenecks.
Change failure rate is simply the ratio of the number of deployments to the number of failures. But the key here is to define what is a failure. This will be unique to you, your team, and your service. In fact, it will probably change over time as your team improves. The common mistake is to simply look at the total number of failures instead of the change failure rate. The problem with this is it will encourage the wrong type of behaviors. Our goal here is to ship change as quickly, and if you're simply looking at the total number of failures, your natural response is try to reduce the number of deployments so that you might have fewer incidences. The problem with this, as we mentioned earlier, is that the changes are so large that the impact of failing, when it does happen, is going to be high, which is going to result in a worse customer experience. What you want, is when a failure happens, to be so small and so well understood that it's not a big deal.
Technically, the key here is to get the developer involved in the production ideally doing the deployment. What you want, is when there is a failure, the developer is involved in production so that they understand the impact of their change and their failure, he can learn from it, creating a critical feedback loop so the developer ensures that this type of incident never happens again.
MTTR or mean time to recovery is just one step in the incident response process. First, you need to detect there's even a problem. Once you've detected it, how quickly can you ship a change out? This is what MTTR focuses on. If you just focus on improving MTTR and none of the other ones, you'll often create these dirty, quick, ugly hacks to try to get the system up and going again. But often, those hacks will actually end up making the incident even worse. This is why it's critical that your team has a culture of shipping lots of changes quickly so that when an incident happens, shipping a fix quickly is natural. It's what they do anyways. That way, the incident won't get any worse.
Technically, try Feature Flags. Feature Flags are toggles that allow you to turn a change on or off in production with a click of a button, so that if you have an incident with the change, you can click a button, turn it off, and reduce your MTTR down to seconds.
Look, the bottom line is don't focus on metrics. It's not about the metrics, it's about your team and its goals. Metrics are how your team knows how well they're progressing towards those goals, so don't focus on the metric, focus on your team and its goals. The key here is to remember it's really all about your development. Empower your developers. Give them the tools they need to succeed because your developers are going to be the ones to be able to make the best changes to help your team reach its goals.