Global checkpoint provides low-latency write transactions with a tradeoff of minimal loss of transaction durability in the rare event of hardware or operating system failure. This section will explain the background and purpose of the feature, describe how it works generally, and then walk through specific failure modes to illustrate when and how global checkpoint recovery is performed.
A transactional database is generally expected to be durable (this is the 'D' in ACID). Specifically this means that a transaction which has been acknowledged to the client as committed cannot subsequently be lost or forgotten; the effects of that transaction are expected to persist even in the event of a recoverable failure. Examples of recoverable failures include database software crash, operating system kernel panic, or hardware reboot (e.g. power loss event).
In order to provide such a guarantee of durability, databases typically employ some sort of Write-Ahead Log (WAL). The WAL is used to record the details of a transaction before the transaction can be acknowledged as committed, but also typically before all associated data structures have been updated on-disk (hence write-ahead). The writes to this WAL must be done in such a fashion that they will persist in the face of any conceivable failure, e.g. power loss or crash; in practice this means either that writes to disk must synchronous (flushed immediately), or the WAL may be stored in some non-volatile memory device (NVRAM). A properly implemented WAL allows the system to recover all committed transactions following an unexpected restart of the database. It does so by replaying the WAL entries, applying changes to database structures (b-trees) which may not have completed before the event, thereby bringing the system into a consistent state so the database can be resumed.
In addition to the challenge of recovering a single node to a consistent state, a consistent distributed database like ClustrixDB faces an additional requirement: all nodes must agree on the state of the system after recovery. That is, a transaction which touched data on three nodes must be fully applied to those three nodes on recovery, or fully undone. If a transaction were somehow applied on some but not all nodes, there would be an inconsistency, which could manifest in users getting different answers depending on the node to which they were connected.
A distributed system thus has a second level of recovery. After the local recovery of individual nodes by WAL replay, the nodes coordinate to determine which transactions should be re-applied or undone in order to bring the nodes into synch. As we will detail further below, if the WAL of each node is fully durable, then committed transactions can always be fully reapplied across the cluster. Conversely, if the WAL is not durable, it may be necessary to undo committed transactions on some nodes in order to bring them into agreement with a recovering node who has lost those transactions; this is the crux of global checkpoint, introduced later.
Global checkpoint allows a node suffering kernel panic or power failure to rejoin the cluster upon recovery. It also handles site-wide power failure scenarios. In order to do so, the cluster may roll back transactions which were previously acknowledged to clients as committed, in order to bring the cluster to an internally consistent state.
|Failure scenario||Global Checkpoint Behavior|
|Database Software Crash||Group change|
|Kernel Panic||Possible rollback of committed transactions|
|Single Node Power Cycle||Possible rollback of committed transactions|
|Multiple Node Power Cycle||Possible rollback of committed transactions|
During a Group change: In-flight transactions are cancelled, but committed transactions are durable.
When a node is lost and the cluster reprotects: Committed transactions are durable on surviving nodes. The failed node cannot rejoin the cluster, and the cluster must make new copies of all data on that node. The node may be formatted and added back to the cluster.
Data loss: If multiple nodes simultaneously suffer power failure, those nodes are marked as unavailable. Since this generally means loss of all copies of some data, that data would no longer be available. In the case where half of the nodes are lost, the cluster cannot form quorum and thus becomes completely unavailable.
Possible rollback of committed transactions: Transactions committed since the last global checkpoint (typically less than 1 second) are rolled back in order to allow the failed node(s) to rejoin the cluster, while preserving a consistent state amongst nodes in the cluster.
This subsection will describe the operation of the global checkpoint mechanism. First, we will define the components involved in global checkpoint. Then we will describe normal operation of global checkpoint. Finally, we will describe recovery operations.
The nanny is a simple process management mechanism which ensures that all processes required for proper operation of a ClustrixDB node are running. Itself invoked by
initd, it is the parent of both ClustrixDB-specific processes and several standard Unix services crucial to a healthy cluster, such as
ntpd. Its logic for process control is very simple: if a child process dies for any reason, it restarts it.
As part of the global checkpoint, nanny includes three mechanisms: cleanup tasks, safety state, and integrity state.
A cleanup task is an additional command which can be assigned to a nanny job, to be run whenever the main command terminates. When the main command exits for whatever reason, the cleanup command is run before the main command is run again to restart the job. The initial usage of this functionality is flushing the WAL to disk when
Safety state is a mechanism to keep track of whether a node is in a durable state, and communicate this to other nodes. Specifically, it currently tracks whether the shared memory WAL segment has been persisted to disk.
Integrity state is used by nodes in the cluster to determine when one of the nodes has been restarted (OS reboot or power cycle), in which case a global checkpoint rollback will be required. The primary mechanism for this is a unique integrity value generated each time nanny starts; other nodes track this value, and confirm that the value remains the same each time they talk to the node, in order to recognize a rebooted node by the changed integrity value.
clxnode is the core database process which runs on each ClustrixDB node. It is started by
nanny, and responds on the MySQL port (3306) when the database is operational.
A section of shared memory is used for the WAL. This allows for the memory to persist in the face of a
clxnode crash, so it can be written to disk by an external process.
clxwalflush is a utility which writes the contents of the shared-memory WAL to disk in the event of a
clxnode crash. This renders the WAL durable for cases other than OS crash or power failure.
globc periodic task runs every half-second, and is responsible for ensuring that every node has durably committed all transactions up to an agreed upon transaction ID. One node is responsible for running the globc task (the responsible node is selected on each new group formation).
During normal, steady-state operation,
clxnode is running on each node, writing transactions to the shared memory WAL on that node.
nanny has recorded a 64 bit integrity value, which is uniquely generated each time
nanny starts (allowing us to determine whether there's been a kernel panic, power cycle, etc.).
Every 500 milliseconds, the
globc periodic task performs the following set of operations:
cid) that each node is aware of
Upon completion of the global checkpoint transaction, we can guarantee that every node has durably committed transactions with commit ID less than or equal to the globc value, even in the face of a kernel panic or power failure. The global checkpoint provides a point in the transaction timeline to which all nodes are able to revert, while still remaining consistent with respect to each other.
Recovery consists of the following steps:
clxnodefailure on a node,
clxwalflushdurably writes shared memory WAL to disk, and
nannyupdates safety state to indicate WAL is durable
These steps could be interrupted at any time, for instance if the failed node tries to re-join the cluster while in the midst of global checkpoint rollback. Accommodations are made to handle all such scenarios properly, with least amount of redundant work.
clxnode crashes (but the hardware and operating system of the node are OK),
nanny immediately runs
clxwalflush, which durably writes out to disk the portion of the WAL in the shared memory segment. Once
clxwalflush has completed,
nanny sets its safety state to indicate that the WAL has been persisted. It will also restart the
clxnode process once
clxwalfush has finished.
The contents of the WAL allow the restarted
clxnode process to redo all operations which may not have completed when it exited abnormally. This includes completing writes to btrees, which are always accompanied with corresponding writes to the ; this is crucial because the contents of the undo log can be used to then perform transaction rollback, either as part of normal GTM recovery, or global checkpoint rollback. Note that this does mean that we "go forward in order to go back": we are redoing local operations from WAL, in order that we might subsequently undo global transactions per GTM or global checkpoint.
The recovery of the surviving nodes depends on the above behavior only insofar as they will wait up to 10 seconds for the failed node to indicate that WAL has been persisted. After that, surviving nodes proceed with recovery independent of and in parallel with the failed node's
clxnode's local recovery.
A standard part of cluster startup, GTM recovery involves the resolution of any transactions which were in-flight when the last group change occurred. This process follows the Paxos protocol, is unchanged from prior versions of ClustrixDB (i.e. prior to global checkpoint) and is thus beyond the scope of this discussion. Suffice to say that it involves rolling back transactions which have not been acknowledged to clients as committed, while rolling forward transactions which have been ack'd committed.
In order to determine whether global checkpoint will be required, it is necessary to determine whether any node has lost committed transaction state. In other words, we confirm whether WAL was durably committed on all nodes. This is done by one arbitrarily selected node, called the leader, who will contact
nanny on each node to validate that:
nannyhas not restarted, which would indicate a node restart, and consequent loss of WAL state
clxnodeis running normally, or
clxwalflushhas durably committed WAL shared memory segment to disk (indicated by safety state)
The leader allows up to 10 seconds to validate node integrity; if after 10 seconds the node has not responded or has not yet indicated that WAL is flushed, the node is considered failed, requiring global checkpoint rollback. Thus a node reboot (which takes minutes) or a network partition lasting more than 10 seconds will result in global checkpoint rollback.
In the case where only the
clxnode database process restarts, node integrity is maintained, and no global checkpoint rollback is performed. GTM recovery is sufficient to bring the cluster into a consistent state, and once that is complete a cluster is formed with surviving nodes, and user transactions can again be processed.
If node integrity cannot be confirmed, global checkpoint rollback occurs. Recall from discussion above that the
globc periodic task ensures the existence of a globally consistent cluster state, recorded in each node's WAL, within roughly the last half second. The commit id associated with this state is our globc value, and global checkpoint recovery consists of undoing operations in order to rewind the transactional timeline back to this globc value. We can do this by walking the undo log for each device backwards until we encounter globc, reverting each action. Note that this occurs independently on each node (vs. GTM resolution which is coordinated amongst nodes in the cluster). Within a node, however, the rollback is coordinated between devices (when multiple devices exist).
Once each node has completed global checkpoint rollback, the cluster can proceed to form and begin servicing user transactions as usual.
This subsection will walk through various failure mode scenarios, discussing whether and how global checkpoint is involved.
In the case of a node being rebooted or shut down from the command line (or SQL SHUTDOWN FULL), logic exists to ensure that the WAL shared memory segment is persisted to disk before the node is actually reset or powered off. Thus a graceful reboot will not result in any global checkpoint rollback.
As detailed above, when
nanny immediately runs
clxwalflush before restarting
clxnode. The failed node will perform local recovery (replay WAL) before trying to rejoin the cluster. Meanwhile, the surviving nodes perform GTM recovery, then note the valid integrity check from the failed node, and so can form a new group (less the failed node) without any global checkpoint rollback. Once the failed node completes local recovery and rejoins the cluster, GTM recovery occurs again (cleaning up any global transaction state from the prior group on the failed node), and the cluster once again is fully operational.
In the event of multiple simultaneous crashes of
clxnode, the situation is not appreciably different from that described above, with respect to global checkpoint. As long as nanny on each node is able to flush WAL to disk so the restarted
clxnode retains committed transaction state, no global checkpoint rollback is required. GTM recovery may be more complex, and intermediate groups may form where some tables are unavailable, but provided that nodes eventually recover, no committed transactions will be lost.
Node restart in this context refers to the node operating system restarting abnormally, due to OS kernel panic or power cycling. See above for discussion of a user-invoked reboot or shutdown, or
clxnode process restart.
To consider a node reboot, let's break the event down into two parts: the initial loss of the node on kernel panic or power loss, and it's reappearance minutes (or more) later after it has booted back up.
nannywill generate a new integrity value.
clxnodestarts and performs local recovery with WAL it has on disk.
In the event of multiple nodes restarting at once the situation is much the same as the single node failure case. If we have sufficient surviving nodes to achieve quorum, they will be unable to contact
nanny on failed nodes, thus initiating global checkpoint rollback. Note that with default REPLICAS=2, a cluster missing two or more nodes will have some number of unavailable tables, which typically results in an unusable database. When the failed nodes recover, they will also be rolled back to the same global checkpoint as the surviving nodes, and can then rejoin the cluster.
For the case where all nodes simultaneously fail, such as a site-wide power failure, all nodes recognize that their integrity and safety state are compromised, resulting in rollback to the last global checkpoint before a cluster is formed.
A network partition, where one or more nodes become inaccessible to the rest of the cluster, can have one of three effects, depending on the duration of the partition (times are approximate and do not take into account GTM resolution):
When connectivity is lost to a node, the GMP (group messaging protocol) heartbeats fail, and after 5 seconds of this a group change is triggered. That group change kicks of a sequence of events like those discussed above: GTM resolution, and then integrity and safety checks. If the network failure is transient, and nanny becomes reachable within 10 seconds after the group change, no global checkpoint rollback is necessary. If the node(s) remain inaccessible, the integrity checks fail after 10 seconds, and this triggers global checkpoint rollback on the surviving nodes; when network connectivity is restored, the partitioned nodes will be rolled back to the same global checkpoint, and then allowed to rejoin the cluster.
Note that in the foregoing discussion, we are assuming that the network issue is allowing a quorum (50% of nodes in the cluster, plus 1). If it were the case that none or fewer than half (plus 1) of the nodes could talk to each other, no group would form at all, thus no GTM resolution nor global checkpoint rollback could occur. If network connectivity is then completely restored, integrity and safety checks will be OK, so only GTM resolution is necessary, no global checkpoint rollback required.
|Scenario||Global Checkpoint Rollback||Committed Transactions Lost|
|Single Node Clxnode Crash||N||N|
|Multiple Node Clxnode Crash||N||N|
|Single Node Restart||Y||Y|
|Multiple Node Restart||Y||Y|