Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space ML1 and version 5.3
Sv translation
languageen

Xpand constantly self-monitors to ensure your cluster is healthy and operating optimally. When it detects conditions that require attention, Xpand will send alerts via email using its Alerter. Alerts are of different severities (INFO, WARNING, ERROR, and CRITICAL) and Xpand is preconfigured with default thresholds for each.

The contacts and communication details that control how alerts are sent must be configured for your cluster.

Table of Contents
maxLevel1

Configuring Alerts

Use the following steps to configure the alerts for your system:

Anchor
Step_1
Step_1
Step 1. Set Identifying Global Variables

Set these identifying global variables for your database. These are especially important to aid Xpand Support in troubleshooting.

sql> SET GLOBAL customer_name = 'customer name';
sql> SET GLOBAL cluster_name = 'cluster identifier';

Step 2. Configure alerts_parameters for SMTP Server 

The parameters defined in the system.alerts_parameters table control how alerts are formatted and sent. 

Tip

Xpand requires an SMTP server to send the alert messages. These instructions presume that an SMTP server has already been set up correctly for your environment. For specifics on establishing an SMTP server in AWS, see Setting up an SMTP Server.

Set the following SMTP parameters as they apply to your cluster.

Parameter NameWhat's Needed?Required?

smtp_server

hostname for SMTP server

Yes

smtp_port

SMTP port for your environment, if different from the default TCP port 25.

Yes

smtp_username

SMTP username

No

smtp_password

SMTP password

No

smtp_security

SMTP security type. Must be SMTPS or TLS.

No

Follow this syntax to update the parameters shown:  

sql> UPDATE  system.alerts_parameters
     SET     value =  'your smpt-specific value'
     WHERE   name=  'parameter name';
        

Step 3. Configure alerts_subscriptions

Add email addresses of the individual(s) or group(s) who are to receive the alerts to the system.alerts_subscriptions table. You can insert, update, and delete from this table using standard SQL.

To see current list of alert subscriptions: 

sql> SELECT * FROM system.alerts_subscriptions;

To add a new email address:

sql> INSERT INTO system.alerts_subscriptions VALUES ('[email protected]_name.com');
        

Step 4. RESET Alerter

Tip

Any time that changes are made to the system.alerts_parameters or system.alerts_subscriptions table(s), the Alerter must be RESET. Your changes will not take effect until this is done.

To reset the Alerter:

sql> ALTER CLUSTER RESET ALERTER;

This will not cause a group change on your cluster.

If invalid information is provided, you may encounter the following error:

sql> ALTER CLUSTER RESET ALERTER; 
ERROR 1 (HY000): [64512] Bad configuration for alerts:

Check clustrix.log for more information. Here is an example where the smtp_server parameter was not specified:

2018-10-11 21:07:51.068524 UTC karma068.colo.sproutsys.com clxnode: ERROR cluster/alerter.ct:219 prepare_write(): Couldn't write alerter config: Bad configuration for alerts: No smtp_server specified

Step 5. Request Alert

Anchor
Request_Alert
Request_Alert

To verify that the configuration works properly, execute this SQL to send a test alert:

sql> SELECT alert(severity, 'alert text');

If you do not receive the expected email alert, please re-review your configuration. 

Sample Emailed Alerts

Here are some sample emailed alert messages that may be similar to some you could encounter on your cluster. These alerts will also appear in the query.log.

Sample 1: Database Space WARNING

This alert is a WARNING for a cluster with a device1 file that is at least 80% full. If you receive a similar warning, see “Issue Resolution” in Managing File Space and Database Capacity.

Severity: WARNING
Date: 2018-10-02 18:49:24.177250 UTC
Host: clxdb003
Cluster: Dogfood7
Version: clustrix-9.1.3
OS Version: CentOS Linux release 7.4.1708 (Core)
Message: Database space is 80% used. Soon user queries will fail. path=/data/clustrix/device1 device_total=4,247,830,372,352 wal_total=1,073,741,824 device_free=327,733,190,656 temp_total_space=161,061,273,600 system_avail=758,480,666,624 system_total=3,757,962,166,272 total_used=2,999,481,499,648 %=80 user_avail=382,684,449,996 user_total=3,382,165,949,644 cont_type=USER trx_type=USER

Sample 2: Backup INFO

This INFO alert shows that the backup has failed. If you receive similar errors during backup processing, please see List of Errors for Backup and Restore. This particular sample shows additional information that is available from clusters deployed in AWS.

Severity: INFO
Date: 2018-09-25 23:42:59.798249 UTC
Host: clxdb005
Cluster: Dogfood7
Version: clustrix-9.1.3
OS Version: CentOS Linux release 7.4.1708 (Core)
EC2 Region: us-west-2a
EC2 Instance ID: i-0882894eb6aa887ac
Message: [SQL] backup-25-09-2018 ERROR 2018-09-25 22:52:02

Sample 3: Read ERROR

This ERROR alert indicates that your system’s disk is experiencing hardware failures. Contact Xpand Support for suggestions.

Severity: ERROR
Date: 2018-09-09 13:18:25.769801 UTC
Host: clxdb001
Cluster: Dogfood7
Version: clustrix-9.1.3
OS Version: CentOS Linux release 7.4.1708 (Core)
Message: Error reading 32768 bytes at offset 0x1d7367d0000 of "/data/clustrix/device1": Input/output error

Additional Information

Alert Severity Codes 

0Critical
1Error
2Warning
3Informational

Anchor
Alerting_Conditions
Alerting_Conditions
Alerting Conditions

These are the conditions that Xpand monitors and for which alerts are issued. These alerts are predefined within the database (system.alerts_messages) and may not be changed. The severity of these alerts range from critical to simply informational.

NameSummaryMessage
ACTIVATION_FAILEDActivation FailedActivation of device &device1 failed
AUTO_RESIZE_FAILEDFailed to automatically resize devicesNot enough room to extend device: node &node_id vdev &number only has &number bytes free, maximum resize is &number
DATABASE_SPACE_CRITICALDatabase space criticalDatabase space is &percent used. User queries will fail, and soon system queries will fail.
DATABASE_SPACE_EXHAUSTEDDatabase space exhaustedDatabase space is &percent used. User queries and system queries will now fail.
DATABASE_SPACE_EXTREMEDatabase space extremeDatabase space is &percent used. User queries will now fail.
DATABASE_SPACE_LOWDatabase space lowDatabase space is &percent used. Soon user queries will fail.
DBSTART_SPACE_PAUSEPausing dbstart due to space exhaustionNo space left for system transactions; not resulting continuation, awaiting cp command
DDL_TOO_LONGDDL lock has been held for too longThe DDL lock has been held for too long. While it is held, all new DDL transactions will block.
DEVICE_DEACTIVATEDDevice DeactivatedDeactivating device &device1
DM_READ_ERRORDevice Manager Read ErrorError reading &bytes bytes at offset &offset
EXCESSIVE_CLOCK_SKEWExcessive Clock SkewClock skew from nid &node_id to &node_id is &seconds seconds. Is NTP set up and working?
HOST_FILE_ERRORError writing host files&error
INACCESSIBLE_TABLESInaccessible TablesThe following is/are not fully accessible in this cluster: &table_name, &table_name...
INSUFFICIENT_REPROTECT_MEMORYInsufficient memory for reprotectionNot enough memory to reprotect if another node is lost: &percent memory table usage (without softfailed nodes) is greater than max &percent
INSUFFICIENT_REPROTECT_NODESInsufficient nodes for reprotectionNot enough nodes to reprotect if another node is lost
INSUFFICIENT_REPROTECT_SPACEInsufficient space for reprotectionNot enough space to reprotect if another node is lost: &percent usage (without softfailed nodes) is greater than max &percent
LICENSE_INVALIDLicense is invalidInvalid license installed
LICENSE_NEAR_EXPIRATIONLicense is nearing expirationLicense will expire at: (&expiration)
LOST_QUORUMLost QuorumNode &node_id lost quorum for group &group_id
MEMORY_TABLE_SPACE_CRITICALMemory table space criticalMemory table space is &percent used. User queries will fail, and soon system queries will fail.
MEMORY_TABLE_SPACE_EXHAUSTEDMemory table space exhaustedMemory table space is &percent used. User queries will now fail.
MEMORY_TABLE_SPACE_EXTREMEMemory table space extremeMemory table space is &percent used. User queries will now fail.
MEMORY_TABLE_SPACE_LOWMemory table space lowMemory table space is &percent used. Soon user queries may fail.
NEW_GROUPNew GroupNode &node_id has new group &group_id
PARTIAL_WRITE_RECOVEREDPartial write recoveredA partial write was detected and recovered. Some space will be unusable unless the node is softfailed, reformatted, and re-added. No immediate action is necessary.
PROTECTION_LOSTProtection LostFull protection lost for some data; queueing writes for down node; reprotection will begin in &seconds seconds if node has not recovered
PROTECTION_RESTOREDProtection RestoredFull protection restored for all data after &seconds seconds
SLAVE_RESTARTSlave RestartRestarting mysqlslave &slave_name
SLAVE_STOPSlave StoppedStopped mysqlslave &slave_name on non-transient error: &Error
USERUser Invoked From SQL&SQL_error
ZONES_UNSPECIFIEDNode zone unspecifiedZones are configured for some, but not all nodes in this cluster. A zone must be specified for node  &node_id

Preconfigured alerts_parameters

These additional entries from the system.alerts_parameters table are pre-configured and shown here for information only.

Some of these parameters include “meta tags” to denote that metadata contents will be substituted in the alert content when that parameter is used. The meta tags are explained in the next section.

parameter_nameValue

body_max_chars

50000

email_body

Severity: ${severity}
Date: ${date} ${tz}
Host: ${host}
Cluster: ${cluster_name}
Version: ${version}
OS Version: ${OS_version}
Message: ${message}

email_encoding

quoted-printable

email_subject

${alerts_name} [${severity}] ${summary}

smtp_sender

${alerts_name} CLX Log Alert

subject_max_chars

100

Anchor
Metadata_used_in_alerts_parameters
Metadata_used_in_alerts_parameters
Metadata used in alerts_parameters  

The alert parameters sometimes contain metadata that is identified by “meta tags”. These meta tags cause real-time information to be substituted within a generated alert.

The following chart shows how each meta tag will be resolved whenever it is used.

Parameter (meta tag) Description

{alerts_name}

Concatenation of cluster name and customer name.

{cluster_name}

Name for the cluster from the global “cluster_name”.

{customer_name}

Name of the customer as identified in the global “customer_name”.

{date}

The system’s current_timestamp.

{group}

ID of the current cluster group.

{host}

Name of host sending the alert.

{message}

Text of the error message from system.alerts_messages.message

{OS_version}

Operating system version.

{severity}

Severity level of the alert as follows:

0 - CRITICAL
1 - ERROR
2 - WARNING
3 - INFO

{summary}

Short form of the error message from system.alerts_messages.summary

{tz}

System time zone from global variable "system_time_zone".

{version}

Software version from global variable "version”.

Sv translation
languageko

ClustrixDB는 클러스터가 정상적이고 최적의 상태로 작동하는지 지속적으로 자체 모니터링을 수행하여 확인합니다. 주의가 필요한 상태를 감지하면 ClustrixDB는 Alerter를 사용하여 전자 메일로 통지합니다. 경고는 다른 심각도(INFO, WARNING, ERROR 및 CRITICAL)를 가지며 ClustrixDB는 각각에 대해 기본 임계값으로 미리 설정되어 있습니다.

경고를 발송하는 방법을 제어하는 연락처 및 통신 세부 사항을 클러스터에 설정해야 합니다.

Table of Contents
maxLevel1

경고 설정

각 클러스터는 ClustrixDB가 경고를 보내는 방법과 대상을 관리하도록 설정되어야 합니다. 경고를 설정하려면 다음 단계를 수행하십시오.

Tip

ClustrixDB가 전자 메일 경고를 제대로 발송하려면 이 단계가 필요합니다.

Anchor
Step_1
Step_1
Step 1. 식별 전역 변수 설정

데이터베이스에 식별 전역 변수를 설정하십시오. 이 변수는 Clustrix 지원팀의 장애진단 처리에 특히 중요합니다.

sql> SET GLOBAL customer_name = 'customer name';
sql> SET GLOBAL cluster_name = 'cluster identifier';

Step 2. smtp 서버에 alerts_parameters를 설정합니다

system.alerts_parameters 테이블에 정의된 매개 변수는 경고 포맷 및 전송 방법을 제어합니다.

Tip

ClustrixDB는 경고 메시지를 보내기 위해 smtp 서버가 필요합니다. 이 단계에서는 smtp 서버가 이미 적절하게 구성되어 있다고 가정합니다.

클러스터에 적용할 때 다음 smtp 매개 변수를 설정하십시오. 필요시 Clustrix 지원팀 이 도움을 제공할 수 있습니다.

매개 변수 이름필요 사항필수 여부

smtp_server

smtp 서버의 식별 정보를 입력합니다.

smtp_port

기본값인 TCP 포트 25와 다른 경우, 사용자 환경에 맞는 smtp 포트를 지정합니다.

smtp_username

smtp 서버의 사용자 이름을 입력합니다.

아니오

smtp_password

smtp 서버의 암호를 입력합니다.

아니오

smtp_security

smtp 서버의 보안 코드를 입력합니다.

아니오

다음 구문으로 위에 나열된 매개 변수를 업데이트하십시오.

UPDATE  system.alerts_parameters
  SET   value ='your smpt-specific value'
  WHERE name='parameter name'

Step 3. alert_subscriptions를 설정합니다

경고를 수신할 개인 또는 그룹 수신자의 전자 메일 주소를 system.alerts_subscriptions 테이블에 추가합니다. 표준 SQL 명령을 사용하여 테이블에 추가, 업데이트 및 삭제할 수 있습니다.

다음 sql을 사용하여 정의된 system.alerts_subscriptions를 확인합니다.

sql> SELECT * FROM system.alerts_subscriptions;

아래의 예처럼 새로운 전자 메일 주소를 추가합니다.

sql> INSERT INTO system.alerts_subscriptions VALUES ('[email protected]_name.com');

Step 4. Alerter 재설정

Tip

system.alerts_parameters 또는 system.alerts_subscriptions 테이블이 변경 될 때마다 Alerter를 재설정해야 합니다.

재설정이 완료 될 때까지 변경 사항이 적용되지 않습니다.

Alerter를 재설정하려면 다음 sql을 실행하십시오.

sql> ALTER CLUSTER RESET ALERTER;

다른 ALTER CLUSTER 명령과 달리 이 명령은 클러스터에서 그룹 변경(group change)을 발생시키지 않습니다.

잘못된 정보가 제공되면 다음과 같은 오류가 발생할 수 있습니다.

sql> ALTER CLUSTER RESET ALERTER;
sql> ERROR 1 (HY000): [64512] Bad configuration for alerts:

자세한 내용은 clustrix.log를 확인하십시오. 아래는 smtp_server 매개 변수가 지정되지 않은 오류의 예입니다.

2016-10-11 21:07:51.068524 UTC karma068.colo.sproutsys.com clxnode: ERROR cluster/alerter.ct:219 prepare_write(): Couldn't write alerter config: Bad configuration for alerts: No smtp_server specified

Step 5. 경고 테스트

설정이 제대로 작동하는지 확인하려면 다음 구문과 같이 테스트 알람을 전송합니다.

SELECT alert(severity, 'alert text')

이 예제에서 "INFO" 단계인 심각도 코드 3을 사용했습니다. 경고 문자는 자유롭게 지정할 수 있습니다.

sql> SELECT alert(3,'Testing alert configuration');
     +----------------------------------------+
     | alert(3,'Testing alert configuration') |
     +----------------------------------------+
     |                                      0 |
     +----------------------------------------+
     1 row in set (0.00 sec)

해당 SQL 문은 정보성 경고를 보내서 클러스터의 구성을 테스트합니다. 전송된 경고 메일을 수신하지 않는 경우, 설정이 올바르지 않은 것입니다. Step 1부터 설정을 검토하십시오.

전자 메일로 발송된 경고 예제

다음은 클러스터에서 발생할 수 있는 전자 메일 경고의 몇 가지 예를 살펴보겠습니다. 이러한 경고 메시지는 query.log에도 표시됩니다.

Sample 1: Database Space WARNING

이 경고는 device1 파일의 공간이 적어도 80%가 차있는 클러스터에 대한 경고입니다. 비슷한 경고를 수신하는 경우 파일 용량 및 데이터베이스 용량 관리의 "문제 해결"을 참조하십시오.

Severity: WARNING
Date: 2016-10-02 18:49:24.177250 UTC
Host: clxdb003
HWID: b8:ca:3a:6b:7b:d0
Cluster: Clustrix-Dogfood
Version: 5.0.45-clustrix-7.5.1
Image Version: CentOS release 6.7 (Final)
Message: Database space is 80% used. Soon user queries will fail. path=/data/clustrix/device1 device_total=4,247,830,372,352 wal_total=1,073,741,824 device_free=327,733,190,656 temp_total_space=161,061,273,600 system_avail=758,480,666,624 system_total=3,757,962,166,272 total_used=2,999,481,499,648 %=80 user_avail=382,684,449,996 user_total=3,382,165,949,644 cont_type=USER trx_type=USER

Sample 2: Backup INFO

이 INFO 경고은 백업이 실패했음을 나타냅니다. 백업 처리 중에 유사한 오류 메일을 수신하는 경우 백업 및 복구 오류 목록을 참조하십시오.

Severity: INFO
Date: 2016-09-25 23:42:59.798249 UTC
Host: clxdb005
HWID: 00:25:90:8e:e3:0a
Cluster: Clustrix-Dogfood
Version: 5.0.45-clustrix-7.5.1
Image Version: CentOS release 6.7 (Final)
Message: [SQL] backup-25-09-2016 ERROR 2016-09-25 22:52:02

Sample 3: Read ERROR

이 ERROR 경고은 시스템의 HD/SSD에 하드웨어 오류가 발생했음을 나타냅니다. Clustrix 지원팀에 문의하십시오.

Severity: ERROR
Date: 2016-09-09 13:18:25.769801 UTC
Host: clxdb001
HWID: b8:ca:3a:6b:7b:d0
Cluster: Clustrix-Dogfood
Version: 5.0.45-clustrix-7.5.1
Image Version: CentOS release 6.7 (Final)
Message: Error reading 32768 bytes at offset 0x1d7367d0000 of "/data/clustrix/device1": Input/output error

추가 정보

경고 조건

다음은 ClustrixDB가 모니터링하고 경고가 발생하는 조건입니다. 이러한 경고는 데이터베이스(system.alerts_messages)에 미리 정의되어 있고 변경할 수 없습니다. 경고의 심각도는 심각에서 간단정보까지 다양합니다.

경고 해결에 도움이 필요하면 Clustrix 지원팀에 문의하십시오.

경고요약메시지
ACTIVATION_FAILEDActivation FailedActivation of device &device1 failed
DATABASE_SPACE_CRITICALDatabase space criticalDatabase space is &percent used. User queries will fail, and soon system queries will fail.
DATABASE_SPACE_EXHAUSTEDDatabase space exhaustedDatabase space is &percent used. User queries and system queries will now fail.
DATABASE_SPACE_EXTREMEDatabase space extremeDatabase space is &percent used. User queries will now fail.
DATABASE_SPACE_LOWDatabase space lowDatabase space is &percent used. Soon user queries will fail.
DATABASE_SPACE_OKAYDatabase space okayDatabase space is &percent used.
DBSTART_SPACE_PAUSEPausing dbstart due to space exhaustionNo space left for system transactions; not resulting continuation, awaiting cp command
DDL_TOO_LONGDDL lock has been held for too longThe DDL lock has been held for too long. While it is held, all new DDL transactions will block.
DEVICE_DEACTIVATEDDevice DeactivatedDeactivating device &device1
DM_READ_ERRORDevice Manager Read ErrorError reading &bytes bytes at offset &offset
EXCESSIVE_CLOCK_SKEWExcessive Clock SkewClock skew from nid &node_id to &node_id is &seconds seconds. Is NTP set up and working?
HOST_FILE_ERRORError writing host files&error
INACCESSIBLE_TABLESInaccessible TablesThe following is/are not fully accessible in this cluster: &table_name, &table_name...
INSUFFICIENT_REPROTECT_NODESInsufficient nodes for reprotectionNot enough nodes to reprotect if another node is lost
INSUFFICIENT_REPROTECT_SPACEInsufficient space for reprotectionNot enough space to reprotect if another node is lost: &percent usage (without softfailed nodes) is greater than max &percent
LICENSE_INVALIDLicense is invalidInvalid license installed
LICENSE_NEAR_EXPIRATIONLicense is nearing expirationLicense will expire at: (&expiration)
LOST_QUORUMLost QuorumNode &node_id lost quorum for group &group_id
MONITORED_WAL_SYNC_EXCESSIVE_TIMESlow syncNode &node_id is slow to sync (took &synch_miliseconds ms, cluster avg &avg_miliseconds ms, hard threshold &threshold_miliseconds ms)
NEW_GROUPNew GroupNode &node_id has new group &group_id
PROTECTION_LOSTProtection LostFull protection lost for some data; queueing writes for down node; reprotection will begin in &seconds seconds if node has not recovered
PROTECTION_RESTOREDProtection RestoredFull protection restored for all data after &seconds seconds
SLAVE_RESTARTSlave RestartRestarting mysqlslave &slave_name
SLAVE_STOPSlave StoppedStopped mysqlslave &slave_name on non-transient error: &Error
USERUser Invoked From SQL&SQL_error

사전 설정된 alerts_parameters

system.alerts_parameters 테이블의 다음 추가 항목은 미리 구성된 것으로 정보 목적으로만 제공됩니다.

이러한 매개 변수 중 일부는 “meta tags”를 포함하며 해당 매개 변수가 사용될 때 메타데이터 내용이 경고 내용에서 대체된다는 것을 나타냅니다. 메타 태그 내용은 다음 섹션에서 설명합니다.

parameter_nameValue

body_max_chars

50000

email_body

Severity: ${severity}
Date: ${date} ${tz}
Host: ${host}
HWID: ${hwid}
Cluster: ${cluster_name}
Version: ${version}
Image Version: ${image_version}
Message: ${message}

email_encoding

quoted-printable

email_subject

${alerts_name} [${severity}] ${summary}

smtp_sender

${alerts_name} CLX Log Alert

subject_max_chars

100

alerts_parameters에서 사용되는 메타데이터

경고 매개 변수는 “meta tags (메타 태그)”에 의해 식별되는 메타데이터를 포함할 수 있습니다. 메타 태그를 사용하면 생성된 알림 내에서 실시간 정보로 대체됩니다.

다음 차트는 각 메타 태그가 사용될 때 어떻게 바뀌는지 보여줍니다.

매개 변수 (meta tag) 설명

{alerts_name}

클러스터 이름과 고객 이름의 조합

{cluster_name}

전역 변수 "cluster_name"에 설정된 클러스터 이름

{customer_name}

전역 변수 "customer_name”에서 식별된 고객 이름

{date}

시스템 current_timestamp

{group}

현재 클러스터 그룹 ID

{host}

경고를 보내는 호스트 이름

{hwid}

클러스터 하드웨어 ID

{image_version}

운영 체제 버전

{message}

system.alerts_messages.message에서 오류 메시지 텍스트

{severity}

경보단계:

0 - 심각
1 - 오류
2 - 경고
3 - 정보

{summary}

system.alerts_messages.summary에서 짧은 형식 오류 메시지

{tz}

전역 변수 "system_time_zone"에 설정된 시스템 시간대

{version}

전역 변수 "version"에 설정된 소프트웨어 버전