Smarter Incident Workflow with New Alert & Notification Policies

Aug 22, 2016 by Kadir Türker Gülsoy

For us, incident responders and managers, incident management is a complicated beast that requires an active effort to streamline an effective workflow to identify, analyze, and solve the incidents. Failing to notice a problem is intolerable for us, on the hand, we don't want too many alerts and notifications that cause alert fatigue and may lead to longer response times or to missing significant incidents. To prevent these two crucial needs from becoming a dilemma, we are excited to introduce to you a set of new features: Auto Restart and Alert Count Based Notification Policies.

The Story Behind the Features

Before expanding on these two wonderful features, let's start with the "why" part:

  • Alert fatigue sucks! At the end of the day, it results in longer response times. Besides, desensitization is the foregone conclusion of alert fatigue, so you can easily miss important incidents.
  • Ignorance is bliss. Thomas Gray may not have meant it, but it's true. A low-priority alert should not disturb you in the middle of the night...
  • Ignorance is not bliss, as well. Frequency matters. If a low-priority alert occurs more frequent than an acceptable threshold, it may overshadow a critical problem.
  • We should never miss a critical alert.. Any they shouldn't be forgotten, either. If a critical alert is not resolved within an acceptable period, it may mean that recipients have somehow missed their notifications. An alert may even be forgotten after someone has acknowledged it. Wouldn't it be better to restart the notifications flow in these cases?

We at OpsGenie are also incident responders, and we always ask for more/better tools to satisfy the concerns above. Therefore, these concerns have been our motto and the code word, and we've gathered suggestions from many of our customers and analyzed different approaches. In the end, we decided to add these new members to the family of Alert & Notifications Policies, that are keywords for sophisticated incident workflow and smarter alerts.

Hey! The Problem Is Not Resolved, Yet!

It's true that OpsGenie's escalations are all powerful and even repeatable after a while. However, you may want to build an alternative way for incident resolution insurance, so that you can restart the notification flow (even conditionally). Auto Restart - Notifications Policies will step in that point from now on!

As you may remember from the Alert Notifications Flow, when an alert recipient becomes aware of the alert by viewing the alert content via any of our apps, OpsGenie stops notifying him/her for the sake of not spamming. Furthermore, if someone acknowledges the alert, the notification flow is stopped for all users in most cases. However, when an auto restart- notifications policy hits, the notification flow starts from scratch and recipients start getting notifications even if the alert is acknowledged or recipients had already become aware of the alert.

A Still Tongue Makes a Wise Head

Do you have some events (or tickets) that are not urgent, but worth an analysis? Do you think that these should notify your responders if they occur more than an acceptable threshold? Count Based Notification Policies are your savior this time.

Alert Deduplication and delaying notifications are two essential tools in preventing alert fatigue. However, low-priority alerts may occur more frequent than an acceptable threshold (Regardless of count or frequency).

Using an alert count based notification policy, you can delay notifications for the matched alerts until the specified deduplication condition is met. You have two alternatives to define a count based condition:

  • Delay until alert count is X. Notifications can be delayed until the alert is received X times, hence alert count reaches to the specified value.
  • Delay until alert is received X times in Y. Notifications can be delayed until count increases by the specified value within the specified time frame (sliding window).

Put it in another way, when an alert is created, OpsGenie will not start sending notifications until the condition specified in the policy becomes true. For example, let’s say we have a policy to notify only after the alert count reaches 3. When the alert is created, since the count is 1, the condition is not met, hence no one would be notified for this alert. Once the alert with the same alias are received 2 more times and count reaches 3, the system starts the notification policy. And the behavior will be the same as new alert creation, in terms of notifications.

Similarly, if the policy is configured to only notify when alert is received 3 times within 5 minutes, the system only starts notifications if the alert is received 3 times within 5 minutes. System uses a sliding window of 5 minutes when determining the number of alerts within that time frame.

Short and Sweet: Everyone Deserves Smart Alerts

We are thrilled to introduce these new capabilities to reduce alert noise and help improve your incident response process. We will continue to strive to improve in this area, and we’re looking forward to hearing from you! Please continue to share your use cases, feedback, questions, and feature/improvement requests! You can ping us via chat on our web app or [email protected]

These features are available for the Enterprise plan only. If you haven’t yet, you can sign up for a free trial now.

Try OpsGenie for free!