Engineering Improvement Runbook | Engineering Automations

Photograph of Dylan Etkin

Dylan Etkin

June 28th, 2023

Notifications: don’t let silent disasters crush your dev team

Notifications: don’t let silent disasters crush your dev team

Smart notifications and nudges are table stakes tools for developers looking to streamline their work and stay focused on building improvements. These automatic alerts are key to a more efficient workflow, freeing us from the burden of repetitive, overwhelming, and time critical tasks — aka, toil.

In this third article in our engineering automation series, here’s how to use smart notifications and nudges to optimize your engineering team, protect developers' flow, and prevent unseen disasters from rearing their ugly heads.

(ICYMI: check out how guardrails prevent teams from crashing.)

Automated notifications provide timely, relevant visibility

At a foundational level, notifications refer to messages sent to developers or other team members about events, issues or updates related to their code, the applications they’re developing, or the systems they’re maintaining.

Teams create automated notifications to bring quick visibility and attention to critical information that helps team members to get in front of issues, improve something, or take actions they may not have taken before knowing that information. Something happens and that something is important enough to need the attention of one person or a group of people.

Think of text messaging. If somebody sends you a text, you receive a notification on your device that alerts you that someone asked for your attention and directs you to their message. If you didn’t receive a notification or if the notification came in on a device that you have no access to, then how will you know you have a text? You don't know what you don't know.

Teams can also set up automated smart nudges, which go beyond generalized notifications by providing context-sensitive prompts, ensuring that information is shared with the right people at the right time.

Think of a smart nudge as a timely text notification about a flight delay, allowing you to adjust your plans before heading to the airport. This targeted, relevant alert not only saves time but also enables you to take action to change the outcome.

Avoid notification fatigue for the most impact

It's essential to carefully determine what to share and how to distribute the notification. Promoting visibility fosters a culture of openness and collaboration, but it's vital to strike a balance to avoid notification fatigue. You don’t want to create a "broken windows" effect, where too much information becomes meaningless. If every window is broken, then it's more difficult to inspire care about the neighborhood. If a team member becomes overwhelmed, they may unsubscribe from a channel or decide it’s irrelevant to their work.

Smart nudges help raise visibility for the right people at the right time, encouraging timely action, targeting the notification to where it is most impactful. For instance, if a team has to achieve a goal within a certain timeframe, smart nudges can remind relevant individuals to take action at the appropriate time.

Alerts from observability platforms are also critical. Consider the saying "if a tree falls in the forest and nobody's around to hear it, did it make a sound?" In computing, the answer is you bet it does! Unseen issues will eventually cause problems.

It is essential to gain visibility into potential issues early to allow for proactive remediation, rather than reactively fixing problems in a crisis in the middle of the night. Work smarter, not harder!

Notifications prevent silent disasters from erupting

In software development, we often learn the hard way about the importance of timely notifications and smart nudges. One particular instance comes to mind where we at Sleuth faced a massive infrastructure issue that exploded without prior warning, resulting in an all-hands-on-deck emergency situation at 3 a.m. The problem had been slowly and silently escalating until it became a major crisis.

Our initial approach to impact tracking was fairly naive. We collected data every two minutes and simply wrote it to the database. We had no alarms set up for a critical parameter known as disk queue depth, which is an indicator of the input/output operations of our database. As more and more data entered our platform our naive approach slowly but surely chewed up more and more of our databases available IOPS.

After a slow, unmonitored burn, suddenly things went downhill quickly, necessitating an immediate fix that wasn't readily available. We needed about three weeks to alter how we stored data at scale. In the meantime, we constructed emergency measures, intermittently turning off certain features to allow the system to recover.

With a more proactive notification system in place, we could have identified this problem earlier, dealt with it during normal working hours, and saved ourselves from a "hair on fire" situation. Although we tried our best to shield our customers from the impacts of these issues, it wasn't an ideal way to operate. This was just one of many examples that show how essential smart notifications and visibility are to our work.

How devs use notifications to improve

Notifications can be instrumental in empowering developers to take full ownership of their work. Tools such as Sleuth offer deployment notifications that provide visibility and the right information for developers to successfully move their work from completion to deployment to customer acceptance.

One specific use case for notifications is deployment locking. These notifications give developers an understanding of the system's state, informing them when they can or cannot deploy.

Another example is goal-oriented notifications that help reduce review lag time. By sending notifications to relevant team members about tasks they need to pick up, these alerts can streamline their work process and boost productivity.

Perhaps one of the most beneficial examples of how developers can leverage notifications to improve is through failure notifications. Linking system failures or issues directly to a developer's work can provide immediate insights when failure conditions are met after their work has been deployed. Traditionally, there's a disconnect here as developers often assume their work is fine as long as the system doesn't immediately fail. However, many changes can bring a system close to its breaking point without causing immediate catastrophic failure.

Notifying the developer who implemented that change at the moment the system's strain becomes apparent can be a game-changer. At that moment, the developer still has context and can take immediate corrective action, steering the system away from the breaking point. This level of proactive management can significantly improve how developers work, making the overall development process more efficient and rounding out the definition of done to include landing work safely.

Implement notifications based on team needs

The effort it takes to implement notifications can range from the straightforward task of clicking a button in a system to the complex challenge of creating a bespoke, highly specific solution. The complexity of the process varies with the level of detail and customization required.

Notification systems can be custom built to send based on specific context. I’ve seen a custom notification system set up using encoded rules that notify individuals when something enters a specific environment, then initiates a wait period (a 'soak') in the system. After this time, the system pings the individuals involved in the changes, asking if they approve of advancing to the next stage. Once it receives approval from a majority, it starts the next process, while simultaneously notifying all the involved parties of the progress.

This intricate, context-specific notification system nudges at the appropriate time, drawing attention when necessary. Without it, the onus would fall on each individual to manually check the status, leading to potential confusion and inefficiency.

Implementing a robust notification system, though beneficial, requires significant time and human resource investment. This underlines that effective notifications need strategic planning and resource allocation.

Getting smart notifications in place is well worth the time and effort, with significant benefits to your engineering process and team. Your developers will be empowered to improve because they have the right context at the right time. They’ll be happier because they’re working more efficiently. And your team can stress less by avoiding middle-of-the-night catastrophes. It’s an automation that helps everyone win.

Stay tuned for our next article in our Automations series, where we'll dive into automated actions.

Related Content