Child pages
  • Managing the Rebalancer

This is documentation for a previous version of ClustrixDB.

Skip to end of metadata
Go to start of metadata
The Rebalancer is managed primarily through a set of global variables, and can be monitored through several system tables (or vrels).  
As described in the Rebalancer section, the rebalancer applies a number of actions such as copying replicas, moving replicas, and splitting or redistributing slices in order to maintain an optimal distribution of data on the cluster.  It is designed to perform these operations in a manner which minimizes impact to user queries, and so generally requires little administrative action.  However, there may be circumstances where you wish to either increase or decrease the aggressiveness of the rebalancer, such as quickly rebalancing the cluster after node addition, or eliminating any possible interference with user queries during periods of heavy load.  

The sections below will discuss monitoring of rebalancer behavior, and specific use cases of rebalancer tuning.  For a complete reference of rebalancer tunable parameters, please see the settings that start with "rebalancer_" in Global Variables.  

Rebalancer Monitoring

The table rebalancer_activity_log maintains a record of current and past rebalancer work.  To see recent activity, order by started, as shown below.  You can also filter for currently executing rebalancer actions with WHERE finished IS NULL.  
 
Check recent rebalancer activity
mysql> select * from rebalancer_activity_log order by started desc limit 10;
+---------------------+-------------+-----------------------------+----------+---------------+------------------------------+------------+---------------------+---------------------+-------+
| id                  | op          | reason                      | database | relation      | representation               | bytes      | started             | finished            | error |
+---------------------+-------------+-----------------------------+----------+---------------+------------------------------+------------+---------------------+---------------------+-------+
| 5832803107035702273 | rerank      | distribution read imbalance | statd    | statd_history | __idx_statd_history__PRIMARY |  236879872 | 2013-01-13 05:35:01 | 2013-01-13 05:35:01 | NULL  | 
| 5832802677131749377 | rerank      | distribution read imbalance | statd    | statd_history | __idx_statd_history__PRIMARY |  478674944 | 2013-01-13 05:33:21 | 2013-01-13 05:33:21 | NULL  | 
| 5832802504311179267 | slice split | slice too big               | statd    | statd_history | __idx_statd_history__PRIMARY |  473628672 | 2013-01-13 05:32:41 | 2013-01-13 05:34:08 | NULL  | 
| 5832791312486337538 | rerank      | distribution read imbalance | statd    | statd_history | __idx_statd_history__PRIMARY |  475987968 | 2013-01-13 04:49:15 | 2013-01-13 04:49:15 | NULL  | 
| 5832791036763671553 | slice split | slice too big               | statd    | statd_history | __idx_statd_history__PRIMARY | 1195999232 | 2013-01-13 04:48:11 | 2013-01-13 04:49:15 | NULL  | 
| 5832788503671368706 | rerank      | distribution read imbalance | statd    | statd_history | __idx_statd_history__PRIMARY |  754778112 | 2013-01-13 04:38:21 | 2013-01-13 04:38:21 | NULL  | 
| 5832788202047166465 | slice split | slice too big               | statd    | statd_history | __idx_statd_history__PRIMARY |  471269376 | 2013-01-13 04:37:11 | 2013-01-13 04:38:29 | NULL  | 
| 5832674257801927682 | rerank      | distribution read imbalance | statd    | statd_history | __idx_statd_history__PRIMARY |  754778112 | 2013-01-12 21:15:01 | 2013-01-12 21:15:01 | NULL  | 
| 5832673827981474818 | rerank      | distribution read imbalance | statd    | statd_history | __idx_statd_history__PRIMARY |  471400448 | 2013-01-12 21:13:21 | 2013-01-12 21:13:21 | NULL  | 
| 5832673526398766083 | slice split | slice too big               | statd    | statd_history | __idx_statd_history__PRIMARY |  755400704 | 2013-01-12 21:12:11 | 2013-01-12 21:13:43 | NULL  | 
+---------------------+-------------+-----------------------------+----------+---------------+------------------------------+------------+---------------------+---------------------+-------+
10 rows in set (0.32 sec)

For details such as target/destination for in-progress rebalancer actions, JOIN (using id) to rebalancer_activity_targets, rebalancer_copy_activity, rebalancer_redistributes, or rebalancer_splits.  Note that these are vrels (virtual relations, as opposed to actual tables), and so are only populated for the duration of the activity.  

For a historical log of rebalancer actions, you can also set the global rebalancer_activity_log_level to 2.  This will cause all rebalancer actions to be logged to clustrix.log (or sprout.log on the appliance), on the node which is currently running the rebalancer tasks (you can determine this from system.periodic_tasks).

Rebalancer Tuning

The aggressiveness of the rebalancer is controlled by four or five different global variables, which differ depending on the particular rebalancer action (e.g. splits vs. rebalance):
  • Global limit on number of concurrent rebalancer actions (rebalancer_global_task_limit), applies to all types of rebalancer actions
  • Per action type limit (rebalancer_rebalance_task_limit, rebalancer_redistribute_task_limit, rebalancer_split_task_limit)
  • Limit on number of concurrent rebalancer actions touching a device (rebalancer_vdev_task_limit) 
  • Frequency of the tasks which can enqueue a particular rebalancer action (task_rebalancer_*_interval_ms)
  • Delay between rebalancer transaction and start of copy operation (rebalancer_copy_delay_s)
For a new rebalancer task to be started, it must not cause any of the limits to be exceeded.  Some tasks are usually limited by the per-task limit; rebalance moves, for instance, default to only 1 at a time.  Note that reprotect does not have any per-task limit, so it is limited by the global limit, but more importantly by the vdev limit.  The default value of 1 for vdev limit means that a device will only be the target of a single rebalancer action.  Particularly for software (non-appliance) installations, this is often the effective limit to rebalancer concurrency.  

The frequency of the tasks determine how often operations, such as rebalancer moves, may be enqueued; in the case where there are many small containers, such that copes/moves take only a few seconds, a default frequency of 30 seconds may mean that the rebalancer enqueues operations much less frequently than it could.  Most rebalancer tasks enqueue a limited number of operations at a time, as the required operations to achieve ideal balance change over time.  The notable exception is soft-fail, which enqueues all work to be performed once a node or disk has been soft-failed.  

For operations other than reprotect, the rebalancer pauses for 5 seconds (default) after starting the transaction, before commencing the actual copy from source to target replica.  This is done to reduce the chances of an outstanding user transaction conflicting with the rebalancer operation, in which case the user transaction will be cancelled, with the error "MVCC serializable scheduler conflict."  Note that reprotect, having a higher priority, does not implement this delay.   
Following are some common use cases for tuning the rebalancer settings.  Please see Global Variables for a complete list of rebalancer tunable parameters, and consult with Clustrix Support if you find the need to change parameters not discussed below.  

Increasing Rebalance Aggressiveness

By design (as described in Rebalancer) the rebalancer takes a somewhat leisurely approach to rebalancing data across the cluster.  Since data imbalances between nodes typically take some time to manifest, and generally do not cause significant performance issues, this is generally acceptable.  However, in some situations, it is desirable to rebalance much more quickly:

  • After expanding a cluster to more nodes, particularly where load is very low off-peak (or in an evaluation situation)
  • After replacing a failed node, where balanced workload is critical to meeting performance requirements

Following are recommended changes to increase rebalancer aggressiveness:

 

Increasing Rebalance Aggressiveness
set global rebalancer_vdev_task_limit=4;
set global rebalancer_rebalance_task_limit=8;
set global task_rebalancer_rebalance_distribution_interval_ms=5000;
set global task_rebalancer_rebalance_interval_ms=5000;

Additionally, removing the copy delay allows the rebalancer to move much more quickly smaller replicas, with the caveat that write transactions may encounter "MVCC serializable scheduler conflict" errors:

Removing Rebalancer Copy Delay
set global rebalancer_copy_delay_s=0;

If these settings cause to great a load, reduce the vdev or rebalance task limit.  

Once the rebalancer has finished rebalancing, reset these globals back to default, with set <global>=default;

Increasing Soft-Fail Aggressiveness

As described in Administering Failure and Recovery, soft-fail is a means of moving all data from a node (or disk) in preparation for decommissioning or replacing a node.  With proper use of soft-fail, the system maintains full protection of all data; if a node is removed without soft-fail, there is a window (until reprotect completes) where a failure could lead to data loss.

Soft-fail is treated as high priority by the rebalancer, Soft fail differs from rebalancing, as there is no per-task limit on soft-fail, and the task interval does not limit frequency of moves, as all operations are queued at the time of the softfail (see system.rebalancer_queued_activity).  Thus softfail is limited only by the vdev limit and global limit.  

Following are recommended changes to increase soft-fail aggressiveness:

Increasing Soft-fail Aggressiveness
set global rebalancer_vdev_task_limit=16; --necessary for software, where there is only one device per node 
set global rebalancer_global_task_limit=32;

Additionally, removing the copy delay allows the rebalancer to move much more quickly smaller replicas, with the caveat that write transactions may encounter "MVCC serializable scheduler conflict" errors:

Removing Rebalancer Copy Delay
set global rebalancer_copy_delay_s=0;

If these settings cause to great a load, reduce the vdev or global task limit.  

Once the rebalancer has finished rebalancing, reset these globals back to default, with set <global>=default;

Disabling Rebalancer Entirely

To disable the rebalancer, set each of the rebalancer task intervals to 0:

Disabling The Rebalancer
set global task_rebalancer_rebalance_interval_ms=0;
set global task_rebalancer_rebalance_distribution_interval_ms=0;
set global task_rebalancer_split_interval_ms=0;
set global task_rebalancer_redistribute_interval_ms=0;

Note that it is not necessary to disable the softfail task, as this task will do nothing unless a softfail is initiated.  Similarly, reprotect is not disabled, since it is only active in the event of a disk or node failure, and accidentally leaving it disabled increases the chance of data loss, should a double failure occur.  The other tasks (reap, rerank) have negligible effect on cluster performance, and so are left alone.

We do not recommend leaving the rebalancer disabled for long periods of time, as the rebalancer plays a crucial role in maintaining optimal database performance.  

  • No labels