Skip to end of metadata
Go to start of metadata

This page describes how ClustrixDB architecture was designed for Consistency, Fault Tolerance, and Availability.

See also the System Administration Guide on Fault Tolerance and High Availability

Consistency

Many distributed databases have embraced eventual consistency over strong consistency to achieve scalability. However, this tradeoff comes with the added burden of requiring the application to deal with anomalies that may arise with inconsistent data. Eventual consistency comes with a cost of increased complexity for the application developer.

ClustrixDB provides a consistency model that can scale using a combination of intelligent data distributionmulti-version concurrency control (MVCC), and Paxos. Our approach enables ClustrixDB to scale writes, scale reads in the presence of write workloads, and provide strong ACID semantics.

For an in-depth explanation of how ClustrixDB scales reads and writes, see Concurrency Control.

ClustrixDB takes the following approach to consistency:

  • Synchronous replication within the cluster. All nodes participating in a write must provide an acknowledgment before a write is complete. Writes are performed in parallel.
  • The Paxos protocol is used for distributed transaction resolution.
  • ClustrixDB supports for Read Committed and Repeatable Read (Snapshot) isolation levels with limited support for Serializable.
  • Multi-Version Concurrency Control (MVCC allows) for lockless reads and ensures that writes will not block reads.

Fault Tolerance

ClustrixDB provides fault tolerance by maintaining multiple copies of data across the cluster. The degree of fault tolerance depends on:

  • the specified number of copies (with a minimum of REPLICAS=2)
  • the value set for MAX_FAILURES, which is the the number of nodes that can fail simultaneously without data loss 

In order to fully understand how ClustrixDB handles fault tolerance, you must first familiarize yourself with our data distribution model, in which data is divided into logical chunks called slices and distributed across the cluster. 

The following sections cover the possible failure cases and explain how ClustrixDB handles each situation. For this series of examples we will assume the following:

  • A 5 node cluster with separate front end and back end networks.
  • A total of five slices of data with two replicas each. The primary replica is labeled with a letter (e.g. A), while the secondary replica is labeled with an apostrophe (e.g. A'). 

When a Node becomes Unavailable

A node can become unavailable for a brief period of time, for example, during a kernel panic or some sort of power event. In those cases, the node will become available to the cluster (return to quorum) much more quickly than it would take to reprotect the data from the lost node. To ensure this is handled optimally, ClustrixDB provides setting up a log of changes for the data on the unavailable node that is re-played when the node re-joins the cluster.

When a node goes away the system waits 10 minutes before beginning reprotect actions, allowing the node time to come back online. This default internal can be changed modifying the global variable rebalancer_reprotect_queue_interval_s.

The nodes have a heartbeat process to communicate with the other nodes to ensure they are online. This happens every second. If a node is not online, the cluster will form a new group without the unavailable node and the rebalancer_reprotect_queue_interval_s timer will begin. Once the rebalancer_reprotect_queue_interval_s timer has been exhausted, the cluster begins reprotecting around the failed node. To see more on how to monitor the reprotect process, see Managing the Rebalancer.

Once the cluster has fully reprotected the data on the failed node, a Database Alert will be sent to indicating Full Protection Restored. After that, the cluster is able to safely accommodate another node failure. 

Effects of a node loss on queries in process

To an application, a node loss has the following effect: Since every slice has 2 or more copies, each replica of a slice is written to during every write and the ranked read replica is read during reads. If these replicas are located on a failed node, the query must be retried. If the ranked replica is located on a failed node, the one of the remaining replica(s) will become the ranked replica until a new replica is created. A query that targeted this replica will need to be retried. With single-statement transactions there is an automatic internal retry mechanism that will retry transparent to the client in those cases. During a node failure, the ensuing group change will require that all queries in process will need to be retried.

Permanent Loss of a Node

In this example, Node #2 becomes unavailable due to a permanent hardware failure. Data on the failed node can no longer be accessed by the rest of the cluster. However, other nodes within the system have copies of that data. The database will automatically create a group with the remaining nodes and in the background, the Rebalancer automatically detects that some data is underprotected and will work to create sufficient copies of the data. The diagram below demonstrates how ClustrixDB recovers from the node failure. 

  • Multiple nodes participate in the recovery of a single failed node. 
    • In ClustrixDB data reprotection is a many-to-many operation. Clusters with more nodes can recover faster from a node failure.
  • Once the cluster makes new copies of slices A and C, we now have a fully protected system with two replicas of each slice. 
    • The cluster can now handle another failure without having to replace the failed node.
ClustrixDB handles a complete node failure. The failed state (left) and recovered state (right).

Note that once the Rebalancer has completed the reprotect process, there are now two copies of A and C. During this process, the database remains online.

Configuring Additional Replicas

By default, every table is created with REPLICAS = 2, which ensures that a single node can be lost permanently without any data loss. By setting the MAX_FAILURES level for the cluster and configuring additional replicas for tables, you can ensure that a cluster can encounter multiple node failures and remain available and with all data. See more on how to set MAX_FAILURES.

Temporary Unavailability of a Node

A node can become unavailable for a brief period of time, for example, during a kernel panic or some sort of power event. In those cases, the node will become available to the cluster (return to quorum) much more quickly than it would take to reprotect the data from the lost node. ClustrixDB provides special handling for this scenario by setting up a log of changes for the data on the unavailable node that is re-played when the node re-joins the cluster.

In the example below, node #4 temporarily leaves the cluster. Nodes #2 and #5 set up logs (called Queues) to track changes to data on node #4. Once node #4 returns to the cluster, it re-plays the logged changes. All of these operations are handled automatically by the database.

ClustrixDB handles a transient node failure. The failed state (left) and recovered state (right).

Node 4 was unavailable for a brief period of time and re-joins the cluster. Before Node 4 comes fully online, transactions affecting data on its node are played back.

 

Front End Network Failure

ClustrixDB recommends configuring a redundant front end network, which greatly reduces the chance of a complete front end network failure. In the event that such a failure does occur, the failed node will continue to participate in query resolution. However, it will not be able to accept incoming connections from the application.

Back End Network Failure

ClustrixDB does not distinguish between a back end network failure and a node failure. If a node cannot communicate with its peers on the back end network it will fail a heartbeat check and the system will consider the node offline. 

Availability

In order to understand ClustrixDB's availability modes and failure cases, it is necessary to understand our group membership protocol.

Group Membership and Quorum

ClustrixDB uses a distributed group membership protocol. The protocol maintains two fundamental sets:

  1. The static set of all nodes known to the cluster
  2. The set of nodes which can currently communicate with each other.

The cluster cannot form unless more than half the nodes in the static membership are able to communicate with each other. 

For example, if a six-node cluster experiences a network partition resulting in two sets of three nodes, ClustrixDB will be unable to form a cluster.

However, if more than half the nodes are able to communicate, ClustrixDB will form a cluster with the majority set.

Due to the quorum requirement, the value for MAX_FAILURES must be at least N/2+1. 

Partial Availability

In the above example, ClustrixDB formed a cluster because it could form a quorum. However, such a cluster could offer only partial availability because the cluster may not have access to the complete data set.

In the following example, ClustrixDB was configured to maintain two replicas. However, both nodes holding replicas for A are unable to participate in the cluster (due to some failure). When a transaction attempts to access data on slice A, the database will generate an error that will surface to the application.

Availability Requirements

ClustrixDB can provide availability even in the face of failure. In order to provide full availability, ClustrixDB requires that:

  • A majority of nodes are able to form a cluster (i.e. quorum requirement).
  • The available nodes hold at least one replica for each set of data.