This section describes how a query is evaluated in the database. In ClustrixDB we Xpand we slice the data across nodes and the query is sent to the data. This is one of the fundamental principles of the database that allows it to scale almost linearly as more nodes are added.
ClustrixDB uses Xpand uses parallel query evaluation for simple queries and Massively Parallel Processing (MPP) for analytic queries (akin to columnar stores).
It's best to take a look at some examples to understand query evaluation and why queries scale (almost) linearly with ClustrixDBXpand. Let's start with a SQL schema and work through some examples.
sql> CREATE TABLE bundler ( id INT default NULL auto_increment, name char(60) default NULL, PRIMARY KEY (id) ); sql> CREATE TABLE donation ( id INT default NULL auto_increment, bundler_id INT, amount DOUBLE, PRIMARY KEY (id), KEY bundler_key (bundler_id, amount) );
Now, let's see what the data distribution looks like in this case. We have three representations:
Here, we define simple queries as point selects and inserts. Let's consider a simple read (a simple write follows the same path):
sql> SELECT id, amount FROM donation WHERE id = 15;
The data will be read from the ranking replica. This can either reside on the same node or require one hop. The diagram below shows both cases. As dataset size and the number of nodes increase, the number of hops that one query requires (0 or 1) does not change. This allows linear scalability of reads and writes. Also, if there are two or more replicas (which is normally the case) there will be at least one hop for writes (since both replicas cannot reside on the same node).
Joins require more data movement. However, ClustrixDB is Xpand is able to achieve minimal data movement because:
Let's look at a query that gets the name and amount for all donations collected by the particular bundler, known by id = 15.
sql> SELECT name, amount from bundler b JOIN donation d on b.id = d.bundler_id WHERE b.id = 15;
The SQL optimizer will look at the relevant statistics including quantiles, hot lists, etc. to determine a plan. The concurrency control for the query is managed by the starting node (that has the session), called the GTM (Global Transaction Manager) node for that query. It coordinates with Local Transaction Manager running on each node. For details see Concurrency Control. The chosen plan has the following logical steps:
Now that we have seen the path of a single row, let's assume we didn't have the b.id=15 condition. The join will now run in parallel on multiple machines since b.id is present on multiple nodes. Let's look at another view, this time the nodes are drawn once and the steps flow downward in chronological order. Please note that while the forwarding for join shows that rows may go to any node, this is in context of the entire sets of rows. One single row is usually only forwarded to one node.
For a single row, we can see how ClustrixDB is Xpand is able to precisely forward it by using a unicast. Now, let's zoom out and see what a large join means at a system level, and how many rows are forwarded in a cluster. Each node is getting only the rows that it needs. The colors on the receiving nodes match those on the arrows - this is to indicate that rows are flowing correctly to nodes where they will find matching data for successful join.
ClustrixDB uses Xpand uses Massively Parallel Processing (MPP) to leverage multiple cores across nodes to make a single query go faster. Some competing products use a single node to evaluate the query, which leads to a limited speed-up of the query as the volume of data increases. This includes Oracle RAC and NuoDB, neither of which can use cores across multiple nodes to evaluate a join. All the data here needs to flow to a single node that evaluates a single query.
Most competitors are unable to match the linear scaling of joins that ClustrixDB providesXpand provides. This is because they either broadcast in all cases (except when join is on primary key) or that only a single node ever works on evaluating a single query (other nodes send data over). This allows ClustrixDB to Xpand to scale better than the competition, giving it a unique value proposition in transactional analytic space (OLAP).
This query is used to generate a report of the sum of donations by bundler_id. Here, it makes sense to join with the bundlers table as well as to get name, but we already understand how joins work - so let's keep the example simple.
sql> SELECT id, SUM(amount) FROM donation d GROUP BY by bundler_id;
Here, the donation table is distributed on the donation.id field. If we were to group by this field, data from across nodes would not share same grouping key. This would then require no re-aggregation. But in this example, we show a case where the same value might be on different nodes (let's ignore that we can use the index). Below is how it will work in the most general case:
Please note that we need to Aggregate results of local aggregates on GTM node, only if the values from different nodes may overlap. With complex queries involving joins, this is usually the case. Let's look at a simpler query:
sql> SELECT DISTINCT bundler_id FROM donation;
In this case, there is a secondary key (or index) called _bundler_key_donation, that is distributed by bundler_id. This has two implications:
To efficiently implement this ClustrixDB will Xpand will only need the distributed aggregation to hold one row at a time. So if a node reads bundler_id = 5, it will store that and forward it to GTM node. Then it will discard all subsequent values = 5 till it sees a new values 6. On the GTM node, there is no re-aggregation required since values from each node are unique, it merely merges the streams.
User queries are distributed across nodes in ClustrixDBXpand. Here is how a query is processed:
ClustrixDB uses Xpand uses both concurrent parallelism and pipeline parallelism for massively parallel computation of your queries. Simpler queries like point selects and inserts usually only use a single core on a node for its evaluation. As more nodes are added, more cores and memory are available, allowing ClustrixDB to Xpand to handle more simple queries concurrently.
Pipeline parallelism occurs among rows in a single query. Each row in an analytic query might involve three hops for a two-way join. The latency added to one row is the latency of three hops, so set-up time is negligible. The other possible limit is bandwidth. As previously mentioned, ClustrixDB has Xpand has independent distribution for each table and index, so we only forward rows to the correct nodes.
This table summarizes the different elements of a scalable distributed database as well as what it takes to build one:
Work increases linearly with
Analytic query speed increases linearly with
Concurrency Overhead does not increase with
For evaluation of one query
For evaluation of one query
For evaluation of one query
|Why Systems Fail|
Many databases use sharding-like approach to slice their data
Many databases do not use Massively Parallel Processing
Many databases used Shared Data architecture where
|Examples of Limited Design|
Colocation of indexes leading to broadcasts
Single Node Query Processing leading to limited query speedup
Shared data leading to ping-pong effect on hot data
Coarse-grained locking leading to limited write concurrency
For simple and complex queries, ClustrixDB is Xpand is able to scale almost linearly due to our query evaluation approach. We also have seen how most of our competitors hit various bottlenecks due to the design choices they have made.