Distributed database systems fall into two major categories of data storage architectures: (1) shared disk and (2) shared nothing.
Shared Disk Architecture | Shared Nothing Architecture |
---|---|
Shared disk approaches suffer from several architectural limitations inherent in coordinating access to a single central resource. In such systems, as the number of nodes in the cluster increases, so does the coordination overhead. While some workloads can scale well with shared disk (e.g. small working sets dominated by heavy reads), most workloads tend to scale very poorly -- especially workloads with significant write load.
ClustrixDB uses the shared nothing approach because it's the only known approach that allows for large-scale distributed systems.
In order to build a scalable shared-nothing database system, one must solve two fundamental problems:
Within shared nothing architectures, most databases fall into the following categories:
ClustrixDB has a fine-grained approach to data distribution. The following table summarizes the basic concepts and terminology used by our system. Notice that unlike many other systems, ClustrixDB uses a per-index distribution strategy.
ClustrixDB Distribution Concepts | |
---|---|
Representation | Each table contains one or more indexes. Internally, ClustrixDB refers to these indexes as representations of the table. Each representation has its own distribution key (a.k.a. a partition key or a shard key), meaning that ClustrixDB uses multiple independent keys to slice the data in one table. This is in contrast to most other distributed database systems, which use a single key to slice the data in one table. Each table must have a primary key. If the user does not define a primary key, ClustrixDB will automatically create a hidden primary key. The base representation contains all of the columns within the table, ordered by the primary key. Non-base representations contain a subset of the columns within the table. |
Slice | ClustrixDB breaks each representation into a collection of logical slices using consistent hashing. By using consistent hashing, ClustrixDB can split individual slices without having to rehash the entire representation. |
Replica | ClustrixDB maintains multiple copies of data for fault tolerance and availability. There are at least two physical replicas of each logical slice, stored on separate nodes. ClustrixDB supports configuring the number of replicas per representation. For example, a user may require three replicas for the base representation of a table, and only two replicas for the other representations of that table. |
Consider the following example:
sql> CREATE TABLE example ( id bigint primary key, col1 integer, col2 integer, col3 varchar(64), key k1 (col2), key k2 (col3, col1) ); |
We populate our table with the following data:
Table: example | |||
id | col1 | col2 | col3 |
---|---|---|---|
1 | 16 | 36 | january |
2 | 17 | 35 | february |
3 | 18 | 34 | march |
4 | 19 | 33 | april |
5 | 20 | 32 | may |
ClustrixDB will organize the above schema into three representations. One for the main table (the base representation, organized by the primary key), followed by two more representations, each organized by the index keys.
The yellow coloring in the diagrams below illustrates the ordering key for each representation. Note that the representations for the secondary indexes include the primary key columns.
Table: example | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
base representation | k1 representation | k2 representation | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
primary key
| index (col2)
| index (col3, col1)
|
ClustrixDB will then split each representation into one or more logical slices. When slicing ClustrixDB uses the following rules:
base representation slices | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
slice 1 | slice 2 | slice 3 | ||||||||||||||||||||||||||||||||
|
|
|
k1 representation | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
slice 1 | slice 2 | ||||||||||||||
|
|
k2 representation | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
slice 1 | slice 2 | slice 3 | slice 4 | |||||||||||||||||||||||||||
|
|
|
|
To ensure fault tolerance and availability ClustrixDB contains multiple copies of data. ClustrixDB uses the following rules to place replicas (copies of slices) within the cluster:
Sample data distribution within a 4 node cluster | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
node 1 | node 2 | node 3 | node 4 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
ClustrixDB uses consistent hashing for data distribution. Consistent hashing allows ClustrixDB to dynamically redistribute data without having to rehash the entire data set.
ClustrixDB hashes each distribution key to a 64-bit number. We then divide the space into ranges. Each range is then owned by a specific slice. The table below illustrates how consistent hashing assigns specific keys to specific slices.
Slice | Hash Range | Key Values |
---|---|---|
1 | min-100 | H, Z, J |
2 | 101-200 | A, F |
3 | 201-max | X, K, R |
ClustrixDB then assigns slices to available nodes in the Cluster for data capacity and data access balance.
As the dataset grows, ClustrixDB will automatically and incrementally re-slice the dataset one or more slices at a time. We currently base our re-slicing thresholds on data set size. If a slice exceeds a maximum size, the system will automatically break it up into two or more smaller slices.
For example, imagine that one of our slices grew beyond the preset threshold:
Slice | Hash Range | Key Values | Size |
---|---|---|---|
1 | min-100 | H, Z, J | 768MB |
2 | 101-200 | A, F, U, O, S | 1354MB (too large) |
3 | 201-max | X, K, R, Y | 800MB |
Our rebalancer process will automatically detect the above condition and schedule a slice-split operation. The system will break up the hash range into two new slices:
Slice | Hash Range | Key Values | Size |
---|---|---|---|
1 | min-100 | H, Z, J | 768MB |
4 | 101-150 | A, F | 670MB |
5 | 151-200 | U, O, S | 684MB |
3 | 201-max | X, K, R, Y | 800MB |
Note that system does not have to modify slices 1 and 3. Our technique allows for very large data reorganizations to proceed in small chunks.
It's easy to see why table-level distribution provides very limited scalability. Imagine a schema dominated by one or two very large tables (billions of rows). Adding nodes to the system does not help in such cases since a single node must be able to accommodate the entire table.
Why does ClustrixDB use independent index distribution rather than a single-key approach? The answer is two-fold:
Let's examine a specific use case to compare and contrast the two approaches. Imagine a bulletin board application where different topics are grouped by threads, and users are able to post into different topics. Our bulletin board service has become popular, and we now have billions of thread posts, hundreds of thousands of threads, and millions of users.
Let's also assume that the primary workload for our bulletin board consists of the following two access patterns:
We could imagine a single large table that contains all of the posts in our application with the following simplified schema:
-- Example schema for the posts table. sql> CREATE TABLE thread_posts ( post_id bigint, thread_id bigint, user_id bigint, posted_on timestamp, contents text, primary key (thread_id, post_id), key (user_id, posted_on) ); -- Use case 1: Retrieve all posts for a particular thread in post id order. -- desired access path: primary key (thread_id, post_id) sql> SELECT * FROM thread_posts WHERE thread_id = 314 ORDER BY post_id; -- Use case 2: For a specific user, retrieve the last 10 posts by that user. -- desired access path: key (user_id, posted_on) sql> SELECT * FROM thread_posts WHERE user_id = 546 ORDER BY posted_on desc LIMIT 10; |
With the single key approach, we are faced with a dilemma: Which key do we choose to distribute the posts table? As you can see with the table below, we cannot choose a single key that will result in good scalability across both use cases.
Distribution Key | Use case 1: posts in a thread | Use case 2: top 10 posts by user |
---|---|---|
thread_id | Queries that include the thread_id will perform well. Requests for a specific thread get routed to a single node within the cluster. When the number of threads and posts increases, we simply add more nodes to the cluster to add capacity. | Queries that do not include the thread_id, like the query for last 10 posts by a specific user, must evaluate on all nodes that contain the thread_posts table. In other words, the system must broadcast the query request because the relevant post can reside on any node. |
user_id | Queries that do not include the user_id result in a broadcast. As with the thread_id key sample (use case 2), we lose system scalability when we have to broadcast. | Queries that include a user_id get routed to a single node. Each node will contain an ordered set of posts for a user. The system can scale by avoiding broadcasts. |
One possibility with such a system could be to maintain a separate table that includes the user_id and posted_on columns. We can then have the application manually maintain this index table.
However, that means that the application must now issue multiple writes, and accept responsibility for data consistency between the two tables. And imagine if we need to add more indexes? The approach simply doesn't scale. One of the advantages of a database is automatic index management.
ClustrixDB will automatically create independent distributions that satisfy both use cases. The DBA can specify to distribute the base representation (primary key) by thread_id, and the secondary key by user_id. The system will automatically manage both the table and secondary indexes with full ACID guarantees.
For more detailed explanation consult our Evaluation Model section.
Unlike other systems that use master-slave pairs for data fault tolerance, ClustrixDB distributes the data in a more fine-grained manner as explained in the above sections. Our approach allows ClustrixDB to increase cache efficiency by not sending reads to secondary replicas.
Consider the following example. Assume a cluster of 2 nodes and 2 slices A and B, with secondary copies A' and B'.
Read from both copies | Read from primary copy only | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
If we allow reads from both primary and secondary replicas, then each node will have to cache contents of both A and B. Assuming 32GB of cache per node, the total effective cache of the system becomes 32GB. | By limiting the reads to primary replica only, we make node 1 responsible for A only, and node 2 responsible for B only. Assuming 32GB cache per node, the total effective cache footprint becomes 64GB, or double of the opposing model. |