This page describes how ClustrixDB architecture was designed for Consistency, Fault Tolerance, and Availability.
Many distributed databases have embraced eventual consistency over strong consistency because it makes such systems easier to implement for scale. However, such systems place the burden of dealing with anomalies that may arise on the application. In other words, eventual consistency comes at a cost of increased programming model complexity.ClustrixDB provides a consistency model that can scale using a combination of intelligent data distribution, multi-version concurrency control (MVCC), and Paxos. Our approach enables ClustrixDB to scale writes, scale reads in the presence of write workloads, and provide strong ACID semantics.
For an in-depth explanation of how ClustrixDB scales reads and writes, refer to our concurrency model architecture documentation.
In brief ClustrixDB takes the following approach to consistency:
- Synchronous replication within the cluster. All participating nodes must acknowledge writes before the write can complete. Writes performed in parallel.
- Paxos protocols for distributed transaction resolution.
- Support for Read Committed, Repeatable Read (really Snapshot) isolation levels. Limited support for Serializeable.
- MVCC allows for lockless reads; writes will not block reads.
ClustrixDB provides fault tolerance by maintaining multiple copies of data across the cluster. The degree of fault tolerance depends on the specified number of copies (with a minimum of two). In order to fully understand how ClustrixDB handles fault tolerance, you must first familiarize yourself with our data distribution model.
The following sections cover the possible failure cases and explain how ClustrixDB handles each situation. For this series of examples we will assume the following:
- A five-node cluster with separate front end and back end networks.
- A total of five slices of data with two replicas each. The primary replica is labeled with a letter (e.g. A), while the secondary replica is labeled with an apostrophe (e.g. A').
Permanent Loss of a Node
Consider what happens when the cluster loses node #2 to some permanent hardware failure. Data on the failed node can no longer be accessed by the rest of the cluster. However, other nodes within the system have copies of that data.
The diagram below demonstrates how ClustrixDB recovers from the node failure. Because our data distribution model is peer to peer and not master-slave:
- Multiple nodes participate in the recovery of a single failed node. In ClustrixDB data reprotection is a many-to-many operation. This means that clusters with more nodes can recover faster from a node failure.
- Once the cluster makes new copies of slices A and C, we now have a fully protected system with two replicas of each slice. The cluster can now handle another failure without having to replace the failed node.
Figure: Handling a complete node failure. The failed state (left) and recovered state (right).
Temporary Unavailability of a Node
In some circumstances, a node can exit quorum for a brief period of time. A kernel panic or some sort of power event can cause such conditions. In those cases, we expect the node to return to quorum much faster than it would take to reprotect the entire data set. ClustrixDB provides special handling for this failure scenario by setting up a log of changes for the data on the unavailable node that is replayed when the node re-joins the cluster.
In the example below, node #4 temporarily leaves the cluster. Nodes #2 and #5 set up logs (called Queues) to track changes to data on node #4. Once node #4 returns to the cluster, it re-plays the logged changes. All of these operations are handled automatically by the database.
Figure: Handling a transient node failure. The failed state (left) and recovered state (right).
Front End Network Failure
ClustrixDB supports configuring a redundant front end network, which greatly reduces the chance of a complete front end network failure. In the event that such a failure does occur, the failed node will continue to participate in query resolution. However, it will not be able to accept incoming connections from the application.
Back End Network Failure
ClustrixDB does not distinguish between a back end network failure and a node failure. If a node cannot communicate with its peers on the back end network it will fail a heartbeat check and the system will consider the node offline.
ClustrixDB can provide availability even in the face of failure. In order to provide full availability, ClustrixDB requires that:
- A majority of nodes are able to form a cluster (i.e. quorum requirement).
- The available nodes hold at least one replica for each set of data.
In order to understand ClustrixDB's availability modes and failure cases, it is necessary to understand our group membership protocol.
Group Membership and Quorum
ClustrixDB uses a distributed group membership protocol. The protocol maintains two fundamental sets: (a) the static set of all nodes known to the cluster, and (b) the set of nodes which can currently communicate with each other.
The cluster cannot form unless more than half the nodes in the static membership are able to communicate with each other. In other words, ClustrixDB will not tolerate a split-brain scenario.
For example, if a six-node cluster experiences a network partition resulting in two sets of three nodes, ClustrixDB will refuse to form a cluster.
However, if more than half the nodes are able to communicate, ClustrixDB will form a cluster with the majority set.
In the above example, ClustrixDB formed a cluster because it could form a quorum. However, it is possible that such a cluster could offer only partial availability. In other words, the cluster may not have access to the complete data set.
In the following example, ClustrixDB maintains two replicas. However, both nodes holding replicas for A are not able to participate in the cluster (due to some failure or other condition rendering them unavailable). When a transaction attempts to access data on slice A, the database will generate an error that will surface to the application.