X-ISS Customizes Monitoring System for Faster, More Focused Alerting

When a long-time client added nodes to its HPC cluster, they asked X-ISS to customize the open-source monitoring system already in place to provide faster alerts at the first sign of a critical failure. The X-ISS team streamlined the monitoring system and aggregated alerts so operations personnel could quickly pinpoint trouble without being confused by dozens of simultaneous notifications. As part of the project, X-ISS also integrated temperature and power sensors with the DecisionHPC® platform to provide greater insight into cluster operations.

“Monitoring systems can get cluttered as new servers are added over time, and that’s what happened with this client,” said X-ISS President Deepak Khosla. “By customizing their existing monitoring system, we shortened the alert time for critical failures from 15 minutes to just three minutes, and we aggregated up to 70 alerts into one.”

The client, a leader in providing advanced seismic data processing and visualization services to oil and gas clients, has thousands of compute nodes spread out over several datacenters. Already in place at the time of the most recent cluster upgrade was the Zabbix open-source monitoring solution. Although Zabbix is an excellent alert system, the default set of system checks it performs does not scale up into thousands, as was needed in this case. If not correctly configured, the monitoring system can become overwhelmed. This can slow the notification process. In addition, the open-source solution comes with pre-configured alert settings, or templates, which may or may not satisfy the needs of all users.

Such was the case for this client. Zabbix performs a specified number of checks on each server and switch on a periodic basis. To avoid false alarms, these critical infrastructure elements have to record multiple failures during the check cycle before an alert is generated. The client had found that by the time the alert was delivered via email to operations personnel using the default monitoring settings, the failure had progressed too far for recovery to be successfully implemented.

The X-ISS team wrote custom Zabbix scripts that both shortened the time between critical system checks, and more importantly, the overall time that elapsed before an alert was sent via email to the operations staff. In many cases, this ensured they can address an issue before it impacts applications and end users.

“For our clients, keeping their applications running without interruption is of crucial importance,” said Khosla. “The default settings in Zabbix had to be shortened and then made consistent across all the nodes.”

If a major failure occurred, such as 33 nodes going down in one group, X-ISS created Zabbix scripts that sent specific warning emails and texts to designated individuals. Rather than 33 alerts, as is the Zabbix default, a single alert goes out with the message, “33 nodes in group 5 have failed.” A similar email comes to X-ISS as part of the ongoing ManagedHPC® service provided to clients.

Next on the client’s customization wish list was aggregation of file system alerts. Each compute node could have 50-80 file systems mounted at one time based on the type of workload. Zabbix sends an alert when a single file system gets full. As multiple alerts come in at once, this gets confusing for operations personnel as they search the summary for all those that are full.

X-ISS again rewrote the scripts so that only one alert is sent when a file system on any given node exceeds the threshold. A list of file system capacities is still generated, but the report highlights which file systems are full and which others are nearing their thresholds. This gives the operators an ability to stay one step ahead of their users, adding new storage capacity before the user complains.

“Operations personnel like to know about issues – and avoid a failure – before their end users know something is wrong,” said Khosla.

As part of this customization, X-ISS also wrote a special script to notify a specific individual in operations to let him know when a file had been added or deleted from a file system. In addition, the team prepared scripts to gather environmental temperature and humidity data from a facility management system in the data center and server rooms. An email alert is sent via Zabbix when certain thresholds are exceeded.

For reporting and analytics, the client also uses X-ISS DecisionHPC® software, a high-performance package that delivers business insights into cluster usage via a single dashboard so managers can keep operations running efficiently. During the monitoring upgrade, X-ISS also integrated temperature, power and other data into the DecisionHPC dashboard.

This was accomplished by writing scripts that query sensors in each node to determine their internal temperature and power usage. This data is first sent to a Ganglia distributed monitoring system, where it is pulled by DecisionHPC and presented as a color-coded map on the dashboard. Operations personnel simply glance at the map to see if all colors are the same. If one or more is a different color, the operator can zoom in on that node to determine why the temperature or power usage is out of the normal range.

“The bottom line advantage of customizing and aggregating alerts is that the client now has deeper insight into its HPC cluster operations,” said Khosla. “And the operations personnel know about issues before they spiral into full-blown failures that disrupt the work of their end users.”

Download this case study: StreamlineZabbix.CaseStudy6