System Monitoring and Alerts

Introduction to Alerts

A Cloudmin system alert monitors one or more managed systems, and checks if variables such as free memory, CPU load or free disk space are above or below some threshold. If so, it can send email to the master administrator, system owners or other addresses.

Alerts can be used to detect overloaded systems, lack of resources, or system crashes. They can be limited to a sub-set of the hosts managed by Cloudmin, either selected explicitly or by group membership, virtualization type or owner. This allows alert thresholds to be tailored based on the systems be monitored.

Alerts can be found under the System Monitoring category on the left menu, on the System Alerts page. They are only available in Cloudmin versions 3.7 and later though.

Creating a New Alert

Before you create a new alert, you need to decide which system variable(s) it is going to monitor, what thresholds it will trigger at, and for how long it must be exceeded. For example, you might want to check if CPU load exceeds 1.0 for 30 minutes, or free memory is under 64M for 1 hour.

To create an alert, follow these steps :

  1. Open the System Monitoring category on the left menu, and click on System Alerts .
  2. Click the Add a new system alert link to bring up the alert creation page.
  3. In the Systems to monitor for this alert section, choose the hosts that you want to monitor. They can be chosen by hostname, group, owner or type, or you can select to monitoring all systems. 1 The Exclude down or un-managable systems box can be checked to ignore systems that have been intentionally shut down. This should not be used if you are creating an alert to actually fire when a system is down though.
  4. In the Conditions to trigger alert section, you can enter one or more rules that will be checked when this alert is evaluation. Each rule has the following columns :
    • Variable - The data point collected from each system that the rule will compare
    • Comparison - Set this to Over to have the rule trigger if the variable is over some limit (ie. CPU load), or Under to trigger if below (ie. Memory free).
    • Value - The threshold at which the rule will match. You can use suffixes like MB and GB for rules matching memory or disk space.
    • Time period - How long the variable must be over or under the threshold value for this rule to trigger. Don't enter anything less than the Cloudmin status collection interval of 5 minutes, or the rule will never match.
    • Minimum systems - The number of hosts on which this rule must match for the alert to fire.
  5. By default all conditions must be true for the alert to trigger. However, you can select Trigger alert when any condition is true to change this.
  6. In the Alert notifications, select who will receive email notifications when the alert fires. The default is to only email the Cloudmin master administrator, whose address is set on the Email Settings page.
  7. To limit the rate at which email is sent if the alert continues to fire, fill in the Interval between messages field. The Also email when conditions stop matching box can be checked if you want Cloudmin to notify you when the monitored systems return to a healthy state.
  8. Click the Create button.

Once the alert is created, it will appear in the list of alerts, with its current status shown in the right-hand column.

When creating an alert rule that matches if a system is down (using the System up variable), the comparison should be set to Under and the value to 1. Cloudmin uses the value 0 to indicate a system is completely down, 0.5 when it is up by not contactable via SSH, and 1 for a fully operational host.

Managing Alerts

To edit an alert, just click on it in the System Alerts page. All the same settings that you entered when creating it originally can be modified.

To remove one or more alerts, check the boxes next to them on the System Alerts page, and click the Delete Selected Alerts button.

When any rule on any alert matches at least one system, it will be displayed in the Firing alerts section of the System Alerts page - even if not enough systems or rules match to cause the alert to email. You can view the history of the variable that caused it to fire by clicking the Graph link in the right-most column.

Silencing Alerts

To stop an alert from sending email without actually deleting it, check the box next to it on the System Alerts page, and click the Silence Alerts button. Similarly, to remove the silence use the Un-Silence Alerts button. This can be useful when you are creating or debugging a new alert and don't want it sending out spurious email notifications.