ClustrixDB provides fault tolerance by maintaining multiple copies of data throughout the cluster. This enables a cluster to experience the loss of node(s) or zone(s), without data loss and allowing the cluster to automatically resume operations. 

Built-in Fault Tolerance

By default, ClustrixDB is configured to accommodate a single node failure and automatically maintain 2 copies (replicas) of all data. As long as the cluster has sufficient replicas and a quorum of nodes is available, a cluster can lose a node without experiencing any data loss. Clusters with zones configured can lose a single zone. 

The default settings for fault tolerance are generally acceptable for most clusters. 

Deploying Across Zones

ClustrixDB can be configured to be zone aware so that replicas (and acceptors) are place across different zones (AWS Availability Zones within the same Region, different server racks, different network switches, different power sources). When zones are configured, a cluster can lose an entire zone and automatically recover without loss of data. See Zones for more information. 

Using Replication

Setting up disaster recovery site for your ClustrixDB cluster will allow you to recover from catastrophic failures. Setting up a secondary ClustrixDB cluster for DR will also allow for easier transition to new releases. For information regarding the various replication configurations supported by ClustrixDB, please see Configuring Replication.


ClustrixDB can be configured to survive more than one node (or zone) failure by changing the value of MAX_FAILURES, ensuring that all tables have additional replicas (and sufficient disk space), and that the cluster has a sufficient number of nodes (and zones). See ALTER CLUSTER SET MAX_FAILURES for more information. 

Maintaining additional replicas has a performance overhead, so increasing MAX_FAILURES increases latency for your cluster

What Happens When a Node or Zone Fails?

When ClustrixDB experiences a node or zone failure:

What Happens to Processes That Were Running?

Processes that were running when the heartbeat check failed will be impacted as shown:

Queries (DML and DDL)If the global autoretry is true and a transaction is interrupted by a group change or encounters a retriable error, the database will automatically retry some in-process transactions. Only transactions that were submitted with autocommit = 1, or the first statement of an explicit transaction are retried. Stored procedure and functions calls are never retried. If the retried statements are not executed successfully, the application will receive an error.

ClustrixDB reads data exclusively from ranked replicas. If the failed node(s) contained any ranked replicas for a slice, ClustrixDB assigns that role to another replica of that slice elsewhere.

ReplicationReplication processes will automatically restart at the proper binlog location following the group change.
Other ConnectionsConnections to nodes that are still in quorum will be reestablished and a new group will be formed with the available nodes.

Connections to non-communicative nodes will be lost.