Child pages
  • Global Checkpoint

This is documentation for a previous version of ClustrixDB. Documentation for the latest version can be found here

Skip to end of metadata
Go to start of metadata

Global checkpoint provides low-latency write transactions with a tradeoff of minimal loss of transaction durability in the rare event of hardware or operating system failure.  This section will explain the background and purpose of the feature, describe how it works generally, and then walk through specific failure modes to illustrate when and how global checkpoint recovery is performed. 

Background

Durability of a Single-Instance Database

A transactional database is generally expected to be durable (this is the 'D' in ACID). Specifically this means that a transaction which has been acknowledged to the client as committed cannot subsequently be lost or forgotten; the effects of that transaction are expected to persist even in the event of a recoverable failure. Examples of recoverable failures include database software crash, operating system kernel panic, or hardware reboot (e.g. power loss event). 

In order to provide such a guarantee of durability, databases typically employ some sort of Write-Ahead Log (WAL). The WAL is used to record the details of a transaction before the transaction can be acknowledged as committed, but also typically before all associated data structures have been updated on-disk (hence write-ahead). The writes to this WAL must be done in such a fashion that they will persist in the face of any conceivable failure, e.g. power loss or crash; in practice this means either that writes to disk must synchronous (flushed immediately), or the WAL may be stored in some non-volatile memory device (NVRAM). A properly implemented WAL allows the system to recover all committed transactions following an unexpected restart of the database. It does so by replaying the WAL entries, applying changes to database structures (b-trees) which may not have completed before the event, thereby bringing the system into a consistent state so the database can be resumed. 

Durability of a Distributed Database

In addition to the challenge of recovering a single node to a consistent state, a consistent distributed database like ClustrixDB faces an additional requirement: all nodes must agree on the state of the system after recovery. That is, a transaction which touched data on three nodes must be fully applied to those three nodes on recovery, or fully undone. If a transaction were somehow applied on some but not all nodes, there would be an inconsistency, which could manifest in users getting different answers depending on the node to which they were connected.

A distributed system thus has a second level of recovery. After the local recovery of individual nodes by WAL replay, the nodes coordinate to determine which transactions should be re-applied or undone in order to bring the nodes into synch. As we will detail further below, if the WAL of each node is fully durable, then committed transactions can always be fully reapplied across the cluster. Conversely, if the WAL is not durable, it may be necessary to undo committed transactions on some nodes in order to bring them into agreement with a recovering node who has lost those transactions; this is the crux of global checkpoint, introduced later. 

ClustrixDB Software Durability

Global checkpoint allows a node suffering kernel panic or power failure to rejoin the cluster upon recovery. It also handles site-wide power failure scenarios. In order to do so, the cluster may roll back transactions which were previously acknowledged to clients as committed, in order to bring the cluster to an internally consistent state.

Failure scenarioGlobal Checkpoint Behavior
Database Software CrashGroup change
Kernel PanicPossible rollback of committed transactions
Single Node Power CyclePossible rollback of committed transactions
Multiple Node Power CyclePossible rollback of committed transactions

During a Group change: In-flight transactions are cancelled, but committed transactions are durable.

When a node is lost and the cluster reprotects: Committed transactions are durable on surviving nodes. The failed node cannot rejoin the cluster, and the cluster must make new copies of all data on that node. The node may be formatted and added back to the cluster.

Data loss: If multiple nodes simultaneously suffer power failure, those nodes are marked as unavailable. Since this generally means loss of all copies of some data, that data would no longer be available. In the case where half of the nodes are lost, the cluster cannot form quorum and thus becomes completely unavailable. 

Possible rollback of committed transactions: Transactions committed since the last global checkpoint (typically less than 1 second) are rolled back in order to allow the failed node(s) to rejoin the cluster, while preserving a consistent state amongst nodes in the cluster. 

Global Checkpoint Operation

This subsection will describe the operation of the global checkpoint mechanism. First, we will define the components involved in global checkpoint. Then we will describe normal operation of global checkpoint. Finally, we will describe recovery operations.

Components

nanny

The nanny is a simple process management mechanism which ensures that all processes required for proper operation of a ClustrixDB node are running. Itself invoked by initd, it is the parent of both ClustrixDB-specific processes and several standard Unix services crucial to a healthy cluster, such as ntpd. Its logic for process control is very simple: if a child process dies for any reason, it restarts it.

As part of the global checkpoint, nanny includes three mechanisms: cleanup tasks, safety state, and integrity state.

Cleanup Tasks

A cleanup task is an additional command which can be assigned to a nanny job, to be run whenever the main command terminates. When the main command exits for whatever reason, the cleanup command is run before the main command is run again to restart the job. The initial usage of this functionality is flushing the WAL to disk when clxnode exits. 

Safety State

Safety state is a mechanism to keep track of whether a node is in a durable state, and communicate this to other nodes. Specifically, it currently tracks whether the shared memory WAL segment has been persisted to disk. 

Integrity State

Integrity state is used by nodes in the cluster to determine when one of the nodes has been restarted (OS reboot or power cycle), in which case a global checkpoint rollback will be required. The primary mechanism for this is a unique integrity value generated each time nanny starts; other nodes track this value, and confirm that the value remains the same each time they talk to the node, in order to recognize a rebooted node by the changed integrity value. 

clxnode

clxnode is the core database process which runs on each ClustrixDB node. It is started by nanny, and responds on the MySQL port (3306) when the database is operational. 

Shared-memory WAL

A section of shared memory is used for the WAL. This allows for the memory to persist in the face of a clxnode crash, so it can be written to disk by an external process.  

clxwalflush

clxwalflush is a utility which writes the contents of the shared-memory WAL to disk in the event of a clxnode crash. This renders the WAL durable for cases other than OS crash or power failure. 

globc Periodic Task

The globc periodic task runs every half-second, and is responsible for ensuring that every node has durably committed all transactions up to an agreed upon transaction ID. One node is responsible for running the globc task (the responsible node is selected on each new group formation). 

Normal Operation

During normal, steady-state operation, clxnode is running on each node, writing transactions to the shared memory WAL on that node. nanny has recorded a 64 bit integrity value, which is uniquely generated each time nanny starts (allowing us to determine whether there's been a kernel panic, power cycle, etc.). 

Every 500 milliseconds, the globc periodic task performs the following set of operations:

  1. Determines the latest commit ID (cid) that each node is aware of
  2. Selects the least of these commit IDs to be the globc value
  3. Execute a global checkpoint transaction with the selected globc value, making durable this and all prior committed transactions

Upon completion of the global checkpoint transaction, we can guarantee that every node has durably committed transactions with commit ID less than or equal to the globc value, even in the face of a kernel panic or power failure. The global checkpoint provides a point in the transaction timeline to which all nodes are able to revert, while still remaining consistent with respect to each other. 

Recovery Operation

Recovery consists of the following steps:

  1. In case of clxnode failure on a node, clxwalflush durably writes shared memory WAL to disk, and nanny updates safety state to indicate WAL is durable
    1. When failed node recovers, durable WAL contents allow it to redo btree (and corresponding undo log operations in local recovery)
  2. Surviving nodes resolve all in-flight transactions, committing or rolling back as appropriate (not new to global checkpoint)
  3. Leader checks integrity and safety state of each node in the last group by communicating with the node's nanny
    1. Leader will wait up to 10 seconds for a node's nanny to be reachable
    2. Leader checks integrity and safety state of the nodes
  4. Global checkpoint rollback decision is made:
    1. Rollback is necessary if any of the following are true on any node:
      1. Nanny cannot be reached
      2. Integrity state has changed (i.e. nanny restarted, as in case of reboot)
      3. Safety state does not indicate WAL is durable within 10 seconds (inclusive of time to reach nanny)
    2. Otherwise, if nanny on each node indicates it has not restarted and WAL is durable within 10 seconds, no rollback is required
  5. If rollback is required, each node will walk backwards through undo log on each device, undoing each transaction found until it has reached the globc value
  6. Now that all nodes have consistent state, cluster forms and resumes processing user transactions

These steps could be interrupted at any time, for instance if the failed node tries to re-join the cluster while in the midst of global checkpoint rollback. Accommodations are made to handle all such scenarios properly, with least amount of redundant work. 

clxnode Failure and clxwalflush

Whenever clxnode crashes (but the hardware and operating system of the node are OK), nanny immediately runs clxwalflush, which durably writes out to disk the portion of the WAL in the shared memory segment. Once clxwalflush has completed, nanny sets its safety state to indicate that the WAL has been persisted. It will also restart the clxnode process once clxwalfush has finished.

The contents of the WAL allow the restarted clxnode process to redo all operations which may not have completed when it exited abnormally. This includes completing writes to btrees, which are always accompanied with corresponding writes to the undo log; this is crucial because the contents of the undo log can be used to then perform transaction rollback, either as part of normal GTM recovery, or global checkpoint rollback.  Note that this does mean that we "go forward in order to go back":  we are redoing local operations from WAL, in order that we might subsequently undo global transactions per GTM or global checkpoint.

The recovery of the surviving nodes depends on the above behavior only insofar as they will wait up to 10 seconds for the failed node to indicate that WAL has been persisted.  After that, surviving nodes proceed with recovery independent of and in parallel with the failed node's clxnode's local recovery.  

Surviving Nodes Perform Standard GTM Recovery

A standard part of cluster startup, GTM recovery involves the resolution of any transactions which were in-flight when the last group change occurred. This process follows the Paxos protocol, is unchanged from prior versions of ClustrixDB (i.e. prior to global checkpoint) and is thus beyond the scope of this discussion. Suffice to say that it involves rolling back transactions which have not been acknowledged to clients as committed, while rolling forward transactions which have been ack'd committed. 

Integrity of Nodes is Validated

In order to determine whether global checkpoint will be required, it is necessary to determine whether any node has lost committed transaction state.  In other words, we confirm whether WAL was durably committed on all nodes.  This is done by one arbitrarily selected node, called the leader, who will contact nanny on each node to validate that:

  • nanny has not restarted, which would indicate a node restart, and consequent loss of WAL state
  • clxnode is running normally, or clxwalflush has durably committed WAL shared memory segment to disk (indicated by safety state)

The leader allows up to 10 seconds to validate node integrity; if after 10 seconds the node has not responded or has not yet indicated that WAL is flushed, the node is considered failed, requiring global checkpoint rollback.  Thus a node reboot (which takes minutes) or a network partition lasting more than 10 seconds will result in global checkpoint rollback. 

Normal Recovery

In the case where only the clxnode database process restarts, node integrity is maintained, and no global checkpoint rollback is performed. GTM recovery is sufficient to bring the cluster into a consistent state, and once that is complete a cluster is formed with surviving nodes, and user transactions can again be processed.

Global Checkpoint Rollback Recovery

If node integrity cannot be confirmed, global checkpoint rollback occurs. Recall from discussion above that the globc periodic task ensures the existence of a globally consistent cluster state, recorded in each node's WAL, within roughly the last half second.  The commit id associated with this state is our globc value, and global checkpoint recovery consists of undoing operations in order to rewind the transactional timeline back to this globc value.  We can do this by walking the undo log for each device backwards until we encounter globc, reverting each action.  Note that this occurs independently on each node (vs. GTM resolution which is coordinated amongst nodes in the cluster).  Within a node, however, the rollback is coordinated between devices (when multiple devices exist).  

Once each node has completed global checkpoint rollback, the cluster can proceed to form and begin servicing user transactions as usual. 

Failure Mode Scenarios

This subsection will walk through various failure mode scenarios, discussing whether and how global checkpoint is involved. 

Graceful Reboot

In the case of a node being rebooted or shut down from the command line (or SQL SHUTDOWN FULL), logic exists to ensure that the WAL shared memory segment is persisted to disk before the node is actually reset or powered off.  Thus a graceful reboot will not result in any global checkpoint rollback. 

Single clxnode Crash

As detailed above, when clxnode exits, nanny immediately runs clxwalflush before restarting clxnode. The failed node will perform local recovery (replay WAL) before trying to rejoin the cluster.  Meanwhile, the surviving nodes perform GTM recovery, then note the valid integrity check from the failed node, and so can form a new group (less the failed node) without any global checkpoint rollback.  Once the failed node completes local recovery and rejoins the cluster, GTM recovery occurs again (cleaning up any global transaction state from the prior group on the failed node), and the cluster once again is fully operational. 

Multiple clxnode Crash

In the event of multiple simultaneous crashes of clxnode, the situation is not appreciably different from that described above, with respect to global checkpoint.  As long as nanny on each node is able to flush WAL to disk so the restarted clxnode retains committed transaction state, no global checkpoint rollback is required.  GTM recovery may be more complex, and intermediate groups may form where some tables are unavailable, but provided that nodes eventually recover, no committed transactions will be lost.

Single Node Restart

Node restart in this context refers to the node operating system restarting abnormally, due to OS kernel panic or power cycling.  See above for discussion of a user-invoked reboot or shutdown, or clxnode process restart.  

To consider a node reboot, let's break the event down into two parts: the initial loss of the node on kernel panic or power loss, and it's reappearance minutes (or more) later after it has booted back up.

Initial Node Loss

  1. The failed node is initially detected by other nodes in the cluster, triggering a group change
  2. Surviving nodes perform GTM recovery to resolve in-flight transactions
  3. Randomly selected leader checks integrity state of each node in the prior group
    1. Try to contact failed node once a second for 10 seconds
    2. No response from the failed node's nanny
  4. Global checkpoint rollback is initiated
    1. Each node executes rollback independently
    2. Any transactions committed since the last global checkpoint are reverted 
  5. New group is formed with surviving nodes, resume processing of user transactions

Node Reappearance

  1. When a node restarts, nanny will generate a new integrity value.
  2. Recovering node's clxnode starts and performs local recovery with WAL it has on disk.
  3. Recovering node joins cluster, causing group change
  4. GTM recovery will clean up any transaction state on recovering node from group prior to node crash
  5. Randomly selected leader checks integrity state of each node
    1. Leader discovers that recovering node has new integrity value, indicating node was rebooted, thus requiring rollback
    2. However, other nodes have already performed this rollback
  6. Global checkpoint rollback is initiated for recovering node
    1. Only recovering node rolls back transactions, other nodes do nothing
    2. Any transactions committed since recovering node's last global checkpoint are reverted
  7. New group is formed with all nodes, resume processing of user transactions

Multiple Node Restart

In the event of multiple nodes restarting at once the situation is much the same as the single node failure case.  If we have sufficient surviving nodes to achieve quorum, they will be unable to contact nanny on failed nodes, thus initiating global checkpoint rollback.  Note that with default REPLICAS=2, a cluster missing two or more nodes will have some number of unavailable tables, which typically results in an unusable database.  When the failed nodes recover, they will also be rolled back to the same global checkpoint as the surviving nodes, and can then rejoin the cluster.

For the case where all nodes simultaneously fail, such as a site-wide power failure, all nodes recognize that their integrity and safety state are compromised, resulting in rollback to the last global checkpoint before a cluster is formed. 

Network Partition

A network partition, where one or more nodes become inaccessible to the rest of the cluster, can have one of three effects, depending on the duration of the partition (times are approximate and do not take into account GTM resolution):

  • For partitions lasting less than 5 seconds (default GMP ping timeout), transactions may stall, but no group change will occur
  • For partitions lasting between 5 and 15 seconds, a group change occurs, but committed transactions are not lost
  • For partitions lasting over 15 seconds, both a group change and a global checkpoint rollback occur

When connectivity is lost to a node, the GMP (group messaging protocol) heartbeats fail, and after 5 seconds of this a group change is triggered.  That group change kicks of a sequence of events like those discussed above: GTM resolution, and then integrity and safety checks.  If the network failure is transient, and nanny becomes reachable within 10 seconds after the group change, no global checkpoint rollback is necessary.  If the node(s) remain inaccessible, the integrity checks fail after 10 seconds, and this triggers global checkpoint rollback on the surviving nodes; when network connectivity is restored, the partitioned nodes will be rolled back to the same global checkpoint, and then allowed to rejoin the cluster.

Note that in the foregoing discussion, we are assuming that the network issue is allowing a quorum (50% of nodes in the cluster, plus 1).  If it were the case that none or fewer than half (plus 1) of the nodes could talk to each other, no group would form at all, thus no GTM resolution nor global checkpoint rollback could occur.  If network connectivity is then completely restored, integrity and safety checks will be OK, so only GTM resolution is necessary, no global checkpoint rollback required. 

Summary of Failure Modes

 

ScenarioGlobal Checkpoint RollbackCommitted Transactions Lost
Graceful RebootNN
Single Node Clxnode CrashNN
Multiple Node Clxnode CrashNN
Single Node RestartYY
Multiple Node RestartYY
Network Partitiondependsdepends

 

 

  • No labels