System Monitoring and Alerts

Introduction to Alerts
--------------------------------

A Cloudmin system alert monitors one or more managed systems, and checks if
variables such as free memory, CPU load or free disk space are above or below
some threshold. If so, it can send email to the master administrator, system
owners or other addresses.

Alerts can be used to detect overloaded systems, lack of resources, or system crashes. They can be limited to a sub-set of the hosts managed by Cloudmin, either selected explicitly or by group membership, virtualization type or owner. This allows alert thresholds to be tailored based on the systems be monitored.

Alerts can be found under the **System Monitoring** category on the left menu, on the **System Alerts** page. They are only available in Cloudmin versions 3.7 and later though.

Creating a New Alert
-------------------------------

Before you create a new alert, you need to decide which system variable(s) it is going to monitor, what thresholds it will trigger at, and for how long it must be exceeded. For example, you might want to check if CPU load exceeds 1.0 for 30 minutes, or free memory is under 64M for 1 hour.

To create an alert, follow these steps :

1. Open the **System Monitoring** category on the left menu, and click on **System Alerts** .
1. Click the **Add a new system alert** link to bring up the alert creation page.
1. In the **Systems to monitor for this alert** section, choose the hosts that you want to monitor. They can be chosen by hostname, group, owner or type, or you can select to monitoring all systems.
1 The **Exclude down or un-managable systems** box can be checked to ignore systems that have been intentionally shut down. This should not be used if you are creating an alert to actually fire when a system is down though.
1. In the **Conditions to trigger alert** section, you can enter one or more rules that will be checked when this alert is evaluation. Each rule has the following columns :
* Variable - The data point collected from each system that the rule will compare
* Comparison - Set this to **Over** to have the rule trigger if the variable is over some limit (ie. CPU load), or **Under** to trigger if below (ie. Memory free).
* Value - The threshold at which the rule will match. You can use suffixes like MB and GB for rules matching memory or disk space.
* Time period - How long the variable must be over or under the threshold value for this rule to trigger. Don't enter anything less than the Cloudmin status collection interval of 5 minutes, or the rule will never match.
* Minimum systems - The number of hosts on which this rule must match for the alert to fire.
1. By default all conditions must be true for the alert to trigger. However, you can select **Trigger alert when any condition is true** to change this.
1. In the **Alert notifications**, select who will receive email notifications when the alert fires. The default is to only email the Cloudmin master administrator, whose address is set on the **Email Settings** page.
1. To limit the rate at which email is sent if the alert continues to fire, fill in the **Interval between messages** field. The **Also email when conditions stop matching** box can be checked if you want Cloudmin to notify you when the monitored systems return to a healthy state.
1. Click the **Create** button.

Once the alert is created, it will appear in the list of alerts, with its current status shown in the right-hand column.

When creating an alert rule that matches if a system is down (using the **System up** variable), the comparison should be set to **Under** and the value to **1**. Cloudmin uses the value 0 to indicate a system is completely down, 0.5 when it is up by not contactable via SSH, and 1 for a fully operational host.

Managing Alerts
------------------------

To edit an alert, just click on it in the **System Alerts** page. All the same settings that you entered when creating it originally can be modified.

To remove one or more alerts, check the boxes next to them on the **System Alerts** page, and click the **Delete Selected Alerts** button.

When any rule on any alert matches at least one system, it will be displayed in the **Firing alerts** section of the **System Alerts** page - even if not enough systems or rules match to cause the alert to email. You can view the history of the variable that caused it to fire by clicking the **Graph** link in the right-most column.

Silencing Alerts
-----------------------

To stop an alert from sending email without actually deleting it, check the box next to it on the **System Alerts** page, and click the **Silence Alerts** button. Similarly, to remove the silence use the **Un-Silence Alerts** button. This can be useful when you are creating or debugging a new alert and don't want it sending out spurious email notifications.