What are DORA metrics?
DORA metrics come from an organization called DevOps Research and Assessment. This was a team put together by Google to survey thousands of development teams across multiple industries, to try to understand what makes a high performing team different than a low performing team. What they ended up settling on are these four metrics:
The first metric is Deployment Frequency. This measures how long it takes to get a change in production. The second metric is Change Lead Time. This measures how long it takes to have a change, starting from when the developer works on it all the way till it gets into production. It's how long it takes you to deliver a new feature or fix. The third metric is Change Failure Rate. This looks at the ratio between how many times you've deployed and how many times those deployments are unsuccessful. Finally, there's Mean Time to Recovery (MTTR). MTTR is the average time it takes your team to recover from an unhealthy situation. An unhealthy situation could be caused by a bad change. MTTR is how long on average it takes for your team recover from that.
But technically, it's not about the metrics. It's about the developers. It's about the developers wanting to improve their team's efficiency and using metrics to know whether they're successful in their improving efforts. Remember, about developers, not the metrics.
DORA metrics explained
DORA metrics are also known as Accelerate metrics, thanks to the popular book "Accelerate: The Science of Lean Software and DevOps" by Nicole Forsgren, founder of DORA, Jez Humble, and Gene Kim.
If you prefer to watch a video than to read, check out this 8-minute explainer video by Don Brown, Sleuth CTO and Co-founder and host of Sleuth TV on YouTube. He explains what DORA metrics are and shares his recommendations on how to improve on each of them.
Deployment frequency is very important. In fact, it's usually the first place teams start with DORA metrics. What you're doing is you're measuring how many times you change production. The goal of delivering code quickly to production is to ship as many times as possible. In order to make that work, you need to change the batch size to be as small as possible. In other words, ship as few changes to production at a time as you can.
Common misconception of deploy frequency is that by shipping code to production more often, you're actually creating more risk because the thinking goes that if a certain percentage of my changes to production fail or cause an incident, if I do it more, then I'm going to have more incidences. But counterintuitively, it works the exact opposite way, which is the more you're changing production with smaller changes, the better understood each of those changes are. When those changes are understood and they're small in scope, the risk of those going bad is less. Think about it this way. If I shipped a bunch of changes at once and something goes wrong, which one of those changes caused it? We have no idea. There's so many. Maybe it has 10, 20, 100 developers involved. But if I'm shipping a code one change by one, if one of those things fail, we know exactly what caused it, the developers around, and then they can fix it. By changing your batch size to be as small as possible and shipping as often as possible, you're actually reducing your overall risk.
How to improve Deploy Frequency
Technically, what you want to do here is you want to ship each pull request or individual change to a production at a time. That works great for smaller teams, but it doesn't always work for a bigger team. For example, if you're a big team on say a monolith, what you want to do is a technique called release train, where you ship to production in fixed intervals throughout the day. Again, your goal is to minimize the batch size as much as possible to reduce your overall risk and increase your deployment frequency. Again, ship more smaller.
Change Lead Time
The key to Change Lead Time is to understand what composes change lead time. Change Lead Time as defined in DORA metrics is measured from the moment the developer starts working on a change to the moment that it shipped to production. But you can actually break that time down into buckets. For example, the time a developer's working on the change, that's one bucket. Or the time that your deployment process takes to push a change all the way out to production is another bucket. By looking at things in buckets, you can see what takes the most amount of time and work on optimizing that. Change Lead Time is a really important metric for your company, because what it's doing is it's measuring how quickly your team is able to respond to changing conditions, events, or needs. For example, let's say your customer hits a bug, how quickly can your team create a fix and roll that fix all the way out to production? Or if you need a new feature or a small improvement, how quickly can you deliver that as well? A company that's able to deliver changes quicker tend to be more successful than a company that takes two to three months to get any kind of change out to production.
How to improve Change Lead Time
Technically I found the biggest bucket in Change Lead Time is testing. Teams will often have test as a separate step in a release process, which means that you add days or even weeks to your change lead time. Instead of having it as a separate action, integrate your testing into your development process. Have your testers teach your developers how to write automated tests from the beginning so that you don't need a separate step. To improve your Change Lead Time, eliminate bottlenecks.
Change Failure Rate
Change Failure Rate is simply the ratio of the number of deployments to the number of failures. But the key here is to define what is a failure. This particular DORA metric will be unique to you, your team, and your service. In fact, it will probably change over time as your team improves. The common mistake is to simply look at the total number of failures instead of the change failure rate. The problem with this is it will encourage the wrong type of behaviors. Our goal here is to ship change as quickly, and if you're simply looking at the total number of failures, your natural response is try to reduce the number of deployments so that you might have fewer incidences. The problem with this, as we mentioned earlier, is that the changes are so large that the impact of failing, when it does happen, is going to be high, which is going to result in a worse customer experience. What you want, is when a failure happens, to be so small and so well understood that it's not a big deal.
How to improve Change Failure Rate
Technically, the key here is to get the developer involved in the production ideally doing the deployment. What you want, is when there is a failure, the developer is involved in production so that they understand the impact of their change and their failure, he can learn from it, creating a critical feedback loop so the developer ensures that this type of incident never happens again.
Mean Time To Recovery (MTTR)
The last of the four DORA metrics is MTTR or Mean Time to Recovery. MTTR is just one step in the incident response process. First, you need to detect there's even a problem. Once you've detected it, how quickly can you ship a change out? This is what MTTR focuses on.
The time to detection is a metric in itself, typically known as MTTD or Mean Time to Discovery. If you can detect a problem immediately, you can take MTTD down to practically zero, and since MTTD is part of the calculation for MTTR, improving MTTD helps you improve MTTR.
If you just focus on improving MTTR and none of the other ones, you'll often create these dirty, quick, ugly hacks to try to get the system up and going again. But often, those hacks will actually end up making the incident even worse. This is why it's critical that your team has a culture of shipping lots of changes quickly so that when an incident happens, shipping a fix quickly is natural. It's what they do anyways. That way, the incident won't get any worse.
How to improve MTTR
Technically, try Feature Flags. Feature Flags are toggles that allow you to turn a change on or off in production with a click of a button, so that if you have an incident with the change, you can click a button, turn it off, and reduce your MTTR down to seconds.
The bottom line
Look, the bottom line is don't focus on metrics. It's not about the metrics, it's about your team and its goals. Metrics are how your team knows how well they're progressing towards those goals, so don't focus on the metric, focus on your team and its goals. The key here is to remember it's really all about your development. Empower your developers. Give them the tools they need to succeed because your developers are going to be the ones to be able to make the best changes to help your team reach its goals.
Of course, choosing the right metrics matter, and we advocate using DORA metrics or Accelerate metrics because studies have proven that they affect softtware delivery performance. Sleuth is a tool that helps your team track and improve on DORA metrics. If you're curious about how Sleuth compares with other metrics trackers in the market, check out this detailed comparison guide.