What is Mean Time to Recovery (MTTR)?

July 15, 2021

MTTR, or mean time to recovery, is just one step in the incident response process. First, you need to detect there's even a problem. Once you've detected it, how quickly can you ship a change out? This is what MTTR focuses on.

If you just focus on improving MTTR and none of the other ones, you'll often create these dirty, quick, ugly hacks to try to get the system up and going again. But often, those hacks will actually end up making the incident even worse. This is why it's critical go that your team has a culture of shipping lots of changes quickly, so that when an incident happens, shipping it fixed quickly is natural. It's what they do anyways. And that way, the incident won't get any worse.

Technically, try feature flags. Feature flags are toggles that allow you to turn a change on or off in production with a click of a button so that if you have an incident with the chain, you can click a button, turn it off, and reduce your MTTR down to seconds.