ClustrixDB constantly self-monitors to ensure your cluster is healthy and operating optimally. When it detects conditions that require attention, ClustrixDB will send alerts via email using its Alerter. Alerts are of different severities (INFO, WARNING, ERROR, and CRITICAL) and ClustrixDB is preconfigured with default thresholds for each.
The contacts and communication details that control how alerts are sent must be configured for your cluster.
Use the following steps to configure the alerts for your system:
Set these identifying global variables for your database. These are especially important to aid Clustrix Support in troubleshooting.
sql> SET GLOBAL customer_name = 'customer name'; sql> SET GLOBAL cluster_name = 'cluster identifier'; |
The parameters defined in the system.alerts_parameters table control how alerts are formatted and sent.
ClustrixDB requires an SMTP server to send the alert messages. These instructions presume that an SMTP server has already been set up correctly for your environment. For specifics on establishing an SMTP server in AWS, see Setting up an SMTP Server.
Set the following SMTP parameters as they apply to your cluster.
Parameter Name | What's Needed? | Required? |
---|---|---|
smtp_server | hostname for SMTP server | Yes |
smtp_port | SMTP port for your environment, if different from the default TCP port 25. | Yes |
smtp_username | SMTP username | No |
smtp_password | SMTP password | No |
smtp_security | SMTP security type. Must be SMTPS or TLS. | No |
Follow this syntax to update the parameters shown:
sql> UPDATE system.alerts_parameters SET value = 'your smpt-specific value' WHERE name= 'parameter name'; |
Add email addresses of the individual(s) or group(s) who are to receive the alerts to the system.alerts_subscriptions table. You can insert, update, and delete from this table using standard SQL.
To see current list of alert subscriptions:
sql> SELECT * FROM system.alerts_subscriptions; |
To add a new email address:
sql> INSERT INTO system.alerts_subscriptions VALUES ('[email protected]_name.com'); |
Any time that changes are made to the system.alerts_parameters or system.alerts_subscriptions table(s), the Alerter must be RESET. Your changes will not take effect until this is done.
To reset the Alerter:
sql> ALTER CLUSTER RESET ALERTER; |
This will not cause a group change on your cluster.
If invalid information is provided, you may encounter the following error:
sql> ALTER CLUSTER RESET ALERTER;
ERROR 1 (HY000): [64512] Bad configuration for alerts: |
Check clustrix.log for more information. Here is an example where the smtp_server parameter was not specified:
2018-10-11 21:07:51.068524 UTC karma068.colo.sproutsys.com clxnode: ERROR cluster/alerter.ct:219 prepare_write(): Couldn't write alerter config: Bad configuration for alerts: No smtp_server specified
To verify that the configuration works properly, execute this SQL to send a test alert:
sql> SELECT alert(severity, 'alert text'); |
If you do not receive the expected email alert, please re-review your configuration.
Here are some sample emailed alert messages that may be similar to some you could encounter on your cluster. These alerts will also appear in the query.log.
This alert is a WARNING for a cluster with a device1 file that is at least 80% full. If you receive a similar warning, see “Issue Resolution” in Managing File Space and Database Capacity.
Severity: WARNING Date: 2018-10-02 18:49:24.177250 UTC Host: clxdb003 Cluster: Dogfood7 Version: clustrix-9.1.3 OS Version: CentOS Linux release 7.4.1708 (Core) Message: Database space is 80% used. Soon user queries will fail. path=/data/clustrix/device1 device_total=4,247,830,372,352 wal_total=1,073,741,824 device_free=327,733,190,656 temp_total_space=161,061,273,600 system_avail=758,480,666,624 system_total=3,757,962,166,272 total_used=2,999,481,499,648 %=80 user_avail=382,684,449,996 user_total=3,382,165,949,644 cont_type=USER trx_type=USER |
This INFO alert shows that the backup has failed. If you receive similar errors during backup processing, please see List of Errors for Backup and Restore. This particular sample shows additional information that is available from clusters deployed in AWS.
Severity: INFO Date: 2018-09-25 23:42:59.798249 UTC Host: clxdb005 Cluster: Dogfood7 Version: clustrix-9.1.3 OS Version: CentOS Linux release 7.4.1708 (Core) EC2 Region: us-west-2a EC2 Instance ID: i-0882894eb6aa887ac Message: [SQL] backup-25-09-2018 ERROR 2018-09-25 22:52:02 |
This ERROR alert indicates that your system’s disk is experiencing hardware failures. Contact Clustrix Support for suggestions.
Severity: ERROR Date: 2018-09-09 13:18:25.769801 UTC Host: clxdb001 Cluster: Dogfood7 Version: clustrix-9.1.3 OS Version: CentOS Linux release 7.4.1708 (Core) Message: Error reading 32768 bytes at offset 0x1d7367d0000 of "/data/clustrix/device1": Input/output error |
0 | Critical |
1 | Error |
2 | Warning |
3 | Informational |
These are the conditions that ClustrixDB monitors and for which alerts are issued. These alerts are predefined within the database (system.alerts_messages) and may not be changed. The severity of these alerts range from critical to simply informational.
Name | Summary | Message |
---|---|---|
ACTIVATION_FAILED | Activation Failed | Activation of device &device1 failed |
AUTO_RESIZE_FAILED | Failed to automatically resize devices | Not enough room to extend device: node &node_id vdev &number only has &number bytes free, maximum resize is &number |
DATABASE_SPACE_CRITICAL | Database space critical | Database space is &percent used. User queries will fail, and soon system queries will fail. |
DATABASE_SPACE_EXHAUSTED | Database space exhausted | Database space is &percent used. User queries and system queries will now fail. |
DATABASE_SPACE_EXTREME | Database space extreme | Database space is &percent used. User queries will now fail. |
DATABASE_SPACE_LOW | Database space low | Database space is &percent used. Soon user queries will fail. |
DBSTART_SPACE_PAUSE | Pausing dbstart due to space exhaustion | No space left for system transactions; not resulting continuation, awaiting cp command |
DDL_TOO_LONG | DDL lock has been held for too long | The DDL lock has been held for too long. While it is held, all new DDL transactions will block. |
DEVICE_DEACTIVATED | Device Deactivated | Deactivating device &device1 |
DM_READ_ERROR | Device Manager Read Error | Error reading &bytes bytes at offset &offset |
EXCESSIVE_CLOCK_SKEW | Excessive Clock Skew | Clock skew from nid &node_id to &node_id is &seconds seconds. Is NTP set up and working? |
HOST_FILE_ERROR | Error writing host files | &error |
INACCESSIBLE_TABLES | Inaccessible Tables | The following is/are not fully accessible in this cluster: &table_name, &table_name... |
INSUFFICIENT_REPROTECT_MEMORY | Insufficient memory for reprotection | Not enough memory to reprotect if another node is lost: &percent memory table usage (without softfailed nodes) is greater than max &percent |
INSUFFICIENT_REPROTECT_NODES | Insufficient nodes for reprotection | Not enough nodes to reprotect if another node is lost |
INSUFFICIENT_REPROTECT_SPACE | Insufficient space for reprotection | Not enough space to reprotect if another node is lost: &percent usage (without softfailed nodes) is greater than max &percent |
LICENSE_INVALID | License is invalid | Invalid license installed |
LICENSE_NEAR_EXPIRATION | License is nearing expiration | License will expire at: (&expiration) |
LOST_QUORUM | Lost Quorum | Node &node_id lost quorum for group &group_id |
MEMORY_TABLE_SPACE_CRITICAL | Memory table space critical | Memory table space is &percent used. User queries will fail, and soon system queries will fail. |
MEMORY_TABLE_SPACE_EXHAUSTED | Memory table space exhausted | Memory table space is &percent used. User queries will now fail. |
MEMORY_TABLE_SPACE_EXTREME | Memory table space extreme | Memory table space is &percent used. User queries will now fail. |
MEMORY_TABLE_SPACE_LOW | Memory table space low | Memory table space is &percent used. Soon user queries may fail. |
NEW_GROUP | New Group | Node &node_id has new group &group_id |
PARTIAL_WRITE_RECOVERED | Partial write recovered | A partial write was detected and recovered. Some space will be unusable unless the node is softfailed, reformatted, and re-added. No immediate action is necessary. |
PROTECTION_LOST | Protection Lost | Full protection lost for some data; queueing writes for down node; reprotection will begin in &seconds seconds if node has not recovered |
PROTECTION_RESTORED | Protection Restored | Full protection restored for all data after &seconds seconds |
SLAVE_RESTART | Slave Restart | Restarting mysqlslave &slave_name |
SLAVE_STOP | Slave Stopped | Stopped mysqlslave &slave_name on non-transient error: &Error |
USER | User Invoked From SQL | &SQL_error |
ZONES_UNSPECIFIED | Node zone unspecified | Zones are configured for some, but not all nodes in this cluster. A zone must be specified for node &node_id |
These additional entries from the system.alerts_parameters table are pre-configured and shown here for information only.
Some of these parameters include “meta tags” to denote that metadata contents will be substituted in the alert content when that parameter is used. The meta tags are explained in the next section.
parameter_name | Value |
---|---|
body_max_chars | 50000 |
email_body | Severity: ${severity} Date: ${date} ${tz} Host: ${host} Cluster: ${cluster_name} Version: ${version} OS Version: ${OS_version} Message: ${message} |
email_encoding | quoted-printable |
email_subject | ${alerts_name} [${severity}] ${summary} |
smtp_sender | ${alerts_name} CLX Log Alert |
subject_max_chars | 100 |
The alert parameters sometimes contain metadata that is identified by “meta tags”. These meta tags cause real-time information to be substituted within a generated alert.
The following chart shows how each meta tag will be resolved whenever it is used.
Parameter (meta tag) | Description |
---|---|
{alerts_name} | Concatenation of cluster name and customer name. |
{cluster_name} | Name for the cluster from the global “cluster_name”. |
{customer_name} | Name of the customer as identified in the global “customer_name”. |
{date} | The system’s current_timestamp. |
{group} | ID of the current cluster group. |
{host} | Name of host sending the alert. |
{message} | Text of the error message from system.alerts_messages.message |
{OS_version} | Operating system version. |
{severity} | Severity level of the alert as follows: 0 - CRITICAL |
{summary} | Short form of the error message from system.alerts_messages.summary |
{tz} | System time zone from global variable "system_time_zone". |
{version} | Software version from global variable "version”. |