Child pages
  • Distributed Aggregate Processing

This is documentation for a previous version of ClustrixDB. Documentation for the latest version can be found here

Skip to end of metadata
Go to start of metadata

The Clustrix database has a unique query evaluation model that attempts to parallelize complex query execution as much as possible. Our system performs query evaluation where the relevant data resides. In essence we bring the query to the data. As part of the v5.0 version of Clustrix, the evaluation model has been extended to distribute aggregation of data. This features is called Distributed Aggregates and is built into the Query Optimizer.

How it Works

When a query involves aggregate clauses (such as GROUP BY, DISTINCT, etc.), Clustrix can distribute computation of the aggregate across all nodes with relevant data. Each node computes a local aggregate, then forwards the results to the GTM node. Clustrix uses a cost-based query optimizer, and the relative cost of each possible plan is used to determine the lowest execution plan. Not all queries with aggregates with benefit from distribution, so a plan that leverages Distributed Aggregates will be considered by the optimizer, but may not be ultimately utilized.

Types of Queries that Benefit from Distributed Aggregates

The Clustrix Planner automatically chooses a plan based on cost of each possible execution path. Generally, distributed aggregates is helpful for queries with a large reduction factor, meaning that the query must examine a large number of rows, but returns only a few number of rows. Queries that summarize data and perform aggregation are good candidates for distributed aggregates, and the best way to determine whether a specific query will benefit is to test that query along with a representative query set. 

Configuring Distributed Aggregates

The following global variable controls whether the planner can use the distributed aggregate optimization. Please exercise caution whenever modifying global variables. 

mysql> show global variables like 'sierra_allow_distributed_hash_aggregates';
+------------------------------------------+-------+
| Variable_name                            | Value |
+------------------------------------------+-------+
| sierra_allow_distributed_hash_aggregates | true  |
+------------------------------------------+-------+
1 row in set (0.00 sec) 

The following example demonstares a plan which utilizes distributed aggregates. Notice the partial aggregate step.

mysql> explain select sum(id) from x;
+----------------------------------------------------+-----------+-----------+
| Operation                                          | Est. Cost | Est. Rows |
+----------------------------------------------------+-----------+-----------+
| hash_aggregate_combine expr0 := sum((0 . "expr0")) |     64.13 |     10.67 |
|   hash_aggregate_partial expr0 := sum((1 . "id"))  |     49.20 |     10.67 |
|     index_scan 1 := x.__base_x                     |     16.40 |     10.67 |
+----------------------------------------------------+-----------+-----------+
3 rows in set (0.01 sec)

The  hash_aggregate_partial instruction gets pushed down to each node, while the hash_aggregate_combine will perform the final aggregation step to produce the end result set.

Note: There is an additional global variable for distributed aggregates. The global variable for sierra_allow_distributed_stream_aggregates is disabled by default and should remain false.