This section describes some of the sophisticated controls that are available for managing data distribution. The default behavior for replicas and distribution is sufficient for many workloads.
By default, ClustrixDB uses a hash-based algorithm to distribute data. In addition, ClustrixDB enables you to control data placement in the cluster. The following sections describe the default approach and tell you how to configure data distribution.
The following advice is general: Always consider your application's specific workload when choosing optimization strategies.
The examples in this topic are based on the following table:
By default, ClustrixDB uses all the columns of an index to distribute data. The hash value is computed over (user_id, posted_on). For queries that specify exact constraints on both columns, the database determines which node in the cluster to use.
However, consider the following query:
There is no exact (equality) constraint on the posted_on column. The cluster must search all possible locations for such rows, which can result in poor query performance and scalability. The cluster can still take advantage of index ordering and apply the limit at each node, but the query might go to nodes that do not hold any data for the specified user.
However, evaluating query on multiple nodes can be advantageous if a query needs to find some posts with a common property between multiple users. Such a query uses multiple nodes in parallel, greatly speeding up computation. Even if a query needs to go to multiple nodes, the DBMS continues to optimize, pushing the limit down to each node to reduce the cost of the overall query evaluation.
To control data distribution, you can define which columns of a multi-column key are to be used for data placement, and specify the number of keys to be used for distribution on a per-index basis. To configure data distribution, specify the DISTRIBUTE clause. For example, to limit the index hash to the first column, issue the following command:
Instead of using both columns of the index, the cluster uses only the first column. This layout places all posts by a specific user on one node, so the cluster needs to consult only the node that holds the data of interest.
The columns hashed for data distribution must be contiguous; that is, you cannot skip columns.
For example, for the three-column index ( a,b,c ), you cannot distribute over ( a,c ), skipping b. The following are valid: ( a ), ( a,b ), or ( a,b,c ).
This approach causes the distribution for this key to be based on the key + the first column of the primary key. If the primary key is a compound key, you can expand the distribution to include more columns. In this example, the hash value computed for distribution will be based on (user_id, posted_on, post_id).
Instead of using the DISTRIBUTE clause as noted in the previous example to include the primary key in the distribution, you can use the global variable distribute_include_primary_key. That global is used as the default when DISTRIBUTE is not specified. Set the variable to true (default is false) to consistently include the primary key in the distribution.
To modify the distribution or slices of the primary key after a table has been created, issue the ALTER TABLE command as follows:
For a compound (multi-column) primary key, the distribution count cannot exceed the number of columns that compose the primary key.
To modify the distribution of indexes after a table has been created, issue the ALTER TABLE command as follows:
By default, the system might create numerous small slices when the user changes the distribution of an index. The SLICES argument overrides the system's default behavior.
To determine the optimal number of slices, inspect system.table_sizes and system.index_sizes as follows:
The Size column is the size of the index in bytes. When calculating optimum slice counts for tables and indexes, strive for slices of approximately 1GB or less if you expect the table to grow significantly. Ensure that tables and indexes have a minimum of three slices.
When creating a table, you can explicitly specify the number of replicas:
If no value is specified for the number of replicas, Clustrix creates tables with a default number of replicas = 2. The default number of replicas can be configured via the global variable default_replicas. Once a table has been created, you can use an ALTER statement to modify the number of replicas:
This will automatically copy your data across slices and nodes so that the desired number of replicas are created as part of the DDL operation. You can view the number of replicas that a table has via SHOW CREATE TABLE.
You can also create replicas in an asynchronous fashion by using the ALTER TABLE command with the LAZY PROTECT option.
Lazy protect allows the SQL command to return instantly, and the reprotection to be performed in the background by the Rebalancer. For larger tables, this could result in a longer reprotection time, but less immediate impact to the database. If no value for the number replicas is supplied, the default value will be applied.
While the protection process is in progress, there may be warnings in the logs about ALERT PROTECTION_LOST WARNING. These will no longer appear once the reprotect process is complete.
To optimize performance of bulk data load:
To track progress of a lazy protect operation, see Managing the Rebalancer.
This feature was introduced in v6.0 and offers the option to specify that every node should maintain a full copy of a table.
Replicas = Allnodes is best used for tables that are:
Since a copy of an Allnodes table is maintained on every node, writes are expensive. The result of Replicas = Allnodes is that Clustrix is able to utilize local copies of an allnodes table to minimize the cost of joining to larger tables.
Replicas = Allnodes should not be used for: