Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Published by Scroll Versions from space ML1 and version 5.3


The most common methods of loading data onto ClustrixDB Xpand are:

  1. Take a mysqldump from an existing MySQL instance, then import the dump with clustrix_import. This is the fastest way to get data onto ClustrixDBXpand.
  2. Use LOAD DATA INFILE to import CSV files. ClustrixDB Xpand performs best when data is pre-sliced and the import can be done in parallel.

This is often followed by setting up replication with ClustrixDB Xpand as a slave from the MySQL instance. Once the ClustrixDB Xpand Replication Slave is caught up, then the application can be cut over to ClustrixDB Xpand and the MySQL instance retired.  

mysqldump and clustrix_import

Using mysqldump 

Ensure a consistent transactional backup suitable for setting up replication

When performing the mysqldump from your MySQL server, ensure you provide the proper arguments, particularly for single-transaction and master-data. Note that since MyISAM tables are not transactional, data for those tables can continue to change while mysqldump runs. To get a consistent dump of MyISAM tables, it’s necessary to disable writes entirely or lock all tables during the dump. Since this is generally not feasible on a production cluster, it may be necessary to dump from an existing slave instead, where replication to that slave can be stopped for the duration of the dump. 

Ensure the dump completes successfully

To avoid having mysqldump interrupted by a network connection reset or similar issue, Clustrix Xpand Support recommends using the screen utility to run mysqldump. If you do any amount of serious work at the command line, screen is indispensable. Regardless of whether you use screen or some other method to invoke and monitor, always check the tail of the dump file to make sure the dump completed successfully. You should see some session variable sets. If you see the middle of a multi-row insert instead, your dump was interrupted, or the file was otherwise truncated. Either way, you’re unlikely to be pleased with the results of restoring such a file. 

Don't bother with the mysql database (users and permissions should be copied with clustrix_clone_users)

Avoid dumping internal MySQL databases such as mysql, which are of no use to ClustrixDBXpand. ClustrixDB Xpand will dutifully create a mysql database and restore the contents, but they will have no effect on the functioning of the system, as would be expected if the dump were restored to a MySQL server. In particular, users and permissions cannot be propagated this way. See Migrating User Permissions for information on how to use clustrix_clone_users.


Standard MySQL practice is to import mysqldump by redirecting to the mysql client on the shell command line, or using the source command within the mysql client. Note that this method can result in very long import times as it fails to take advantage of ClustrixDB Xpand parallel processing.  

clustrix_import is a Python script that reads mysqldump output and loads the data into a ClustrixDB Xpand cluster in a multi-threaded fashion. It can be run directly on the cluster (in which case the dump should be staged in the /data/clustrix directory, which has plenty of space), or from any Linux client with Python 2.4 and MySQLdb (MySQL driver for Python) 1.2.1. 


For additional information, please see clustrix_import.

Loading Data Without clustrix_import

If clustrix_import cannot be used to import your data, you can take some proactive measures to ensure efficient data population. While the Rebalancer can ultimately rectify just about any problem created during initial data load, poor slicing and distribution can result in much longer import time, and it may take quite some time for the Rebalancer to achieve optimal data distribution.

Pre-slicing tables
Pre-slicing tables
Pre-slicing Tables

number of slices = number of nodes

When populating large tables (i.e.10GB or larger) it is advantageous to set the table slice count when the table is created and before loading data. This avoids the problem of "racing the Rebalancer", wherein the Rebalancer recognizes that the table needs more slices and begins the splitting process while still actively loading data. This results in longer data load time. If you can estimate the size of the data you are importing (potentially by importing some number of rows and checking the size in system.table_sizes), a good rule of thumb is a little more than 1 slice per 1GB. Generally, you want at least one slice per node for optimal load distribution. Setting the global variable hash_dist_min_slices to the number of nodes will achieve the same result. 

To set slice count at table creation time, simply append SLICES=N to the end of your CREATE statement. You can also reslice the table with ALTER TABLE foo SLICES=N. Note that in both cases, the slicing for the base representation and all indexes are set to N.

Pre-slicing for tables > 100GB

For very large tables (larger than 100GB), you may wish to independently set the slice count for the base/primary representation (which contains all columns of each row) and the indexes (which will contain the columns included in the index, as well as the column(s) of the primary key). Generally, indexes will require fewer slices than the base representation, since the tuples are much narrower; how many fewer slices are required depends on how wide the full table is (particularly how many varchar or blob columns), and whether an index includes such a wide column. Instead of estimating based on column count/size for indexes, you may also load a small but representative portion of your data into ClustrixDBXpand, and then use the system.index_sizes table to ascertain the relative sizes of the base representation and indices. 
You can set slicing for individual indexes by including SLICES=N within the index definition itself. Place the SLICES = keyword before the comma that separates multiple indexes or before the closing parenthesis of the last index.                   


See the information on slices for further information.

Anticipating table growth

In addition to the guidelines above, consider also how much your table is expected to grow over time. You may wish to slice your tables into 0.5GB slices if you anticipate rapid table growth.

Parallelize Data Load

Besides slicing and distribution, the main factor in import speed is the degree of parallelism. A single-threaded import process fails to take proper advantage of the ClustrixDB Xpand parallel architecture, and may run even more slowly than on a MySQL instance. Consider how the data load process could potentially be divided to increase parallelism. For example:

  • For LOAD DATA INFILE, you can split files into smaller chunks and run them in parallel.
  • If your application loads data into the database directly, consider whether this data load can be performed in a multi-threaded fashion, with each thread connecting a different session via a load balancer, to distribute the front-end connections across the cluster nodes.  

Use Multi-Row Inserts

Where possible, aggregate single insert statements into larger multi-row statements. ClustrixDB Xpand handles these multi-row statements more efficiently, particularly since it reduces the per-row transactional overhead. Combining parallelism with multi-row inserts should provide optimal data load performance. (This is the same thing that clustrix_import does).