You’ve implemented service desk functionality, managing incidents and problems with SLA management, but you’re still dependent on your customers, internal or external, to tell you that your systems don’t work or your business processes failed.
Leon Swarts, managing director of Taulite, says: “What’s required is monitoring, consisting of well-structured dashboards, enriched alarms, notification rules with correlation and root cause analysis. It must be noted that core to a well-structured monitoring tool is the use of alarms instead of repeating events.”
Basic monitoring is usually implemented using system-specific checks that generate automated e-mails. However, all too often, this results in an overload of notifications and missed failures. “To implement effective monitoring,” says Swarts, “you need to present failure conditions on clear dashboards including geo-coding, if your infrastructure is spread across various locations. Dashboards driven by alarms showcase the health of your systems in real-time and not only during early morning health checks. It’s also essential to have structured escalation and notification functions using voice, e-mail or SMS, or even WhatsApp as communication channels.”
Escalations are important to notify support staff that are on 24-hour standby when there are issues that require their immediate attention. Receiving an automated call that requires acknowledgement of the notification ensures accountability for the resolution and allows the support engineer to get a good night's sleep knowing that he will be phoned if necessary. No more checking e-mails late at night or early in the morning to ensure everything is up and running.
Integration is also needed to create the necessary incidents on your service management platform. “Feeding into your presentation layer, automation, correlation, root cause analysis, integration and notifications require alarms. Alarms are different from events, such as automated e-mail notifications repeatedly notifying about a single failure, in that alarms have a state, whether this is cleared or active. For an alarm, a state change from cleared to active or from active to cleared triggers automated notifications, escalations or presentation on dashboards. Integrity of the alarm is very important for support and for stakeholders to trust the state presented of an alarm.”
The benefit of using alarms versus events is that you only have one alarm for a disc failure compared to hundreds of continuous events obsessing about the failure. This makes it easy to initiate automation or even notifying support staff or escalating to relevant areas, as a message or action can be initiated on the state change.
Sourcing alarms can be done via intelligently converting repeating events or synchronising to an existing list of active alarms. Alarms can also be generated from automated checks querying data or running scripts. Linking alarms to infrastructure and object types commonly stored in a configuration management database (CMDB) is essential for later correlation and root cause analysis.
Discipline is required to define alarm descriptions, alarm causes and alarm repair actions linked to object types of your infrastructure or CMDB. As alarms are much fewer in number compared to events, and are well structured, it is easy to enrich alarms with relevant information from the rest of the organisation.
“With effective monitoring in place, you will be in control of your systems and processes, managing by exception only, creating more time to build your business and empower you to be proactive and know about issues before customer experience is affected,” concludes Swarts.
Share