Skip to end of metadata
Go to start of metadata

mysqldump and clustrix_import

The most common method of loading data onto ClustrixDB is to:

  1. Take a mysqldump from an existing MySQL instance
  2. Import the dump with clustrix_import

This is often followed by setting up replication with ClustrixDB as a slave from the MySQL instance. Once the ClustrixDB Replication Slave is caught up, then the application can be cut over to ClustrixDB and the MySQL instance retired.  

See also:

mysqldump Best Practices

Ensure a consistent transactional backup suitable for setting up replication

When performing the mysqldump from your MySQL server, ensure you provide the proper arguments, particularly for single-transaction and master-data.  Note that since MyISAM tables are not transactional, data for those tables can continue to change while a mysqldump runs.  To get a consistent dump of MyISAM tables, it’s necessary to disable writes entirely, or lock all tables during the dump.  Since this is generally not feasible on a production cluster, it may be necessary to dump from an existing slave instead, where replication to that slave can be stopped for the duration of the dump. 

Ensure the dump completes successfully

To avoid the mysqldump being interrupted by a network connection reset or similar issue, Clustrix Support recommends using the screen utility to ensure that the session running the mysqldump is not interrupted.  If you are unfamiliar with the screen utility, google “screen utility” to read about GNU Screen.  If you do any amount of serious work at the command line, screen is indispensable.  Regardless of whether you use screen or some other method to invoke and monitor, always check the tail of the dump file to make sure the dump completed successfully.  You should see some session variable sets – if instead you see the middle of a multi-row insert, your dump was interrupted, or the file was somehow otherwise truncated, and either way you’re unlikely to be pleased with the results of restoring such a file. 

Don't bother with the `mysql` database (users and permissions should be copied with clustrix_clone_users)

Avoid dumping internal MySQL databases such as `mysql`, which are of no use to ClustrixDB; ClustrixDB will dutifully create a `mysql` database and restore the contents, but they will have no effect on the functioning of the system, as you might expect were you to restore the dump on a MySQL server. In particular, users and permissions cannot be propagated this way. See the section on Migrating User Permissions on how to use clustrix_clone_users.

clustrix_import

Standard MySQL practice is to import mysqldump by redirecting to the mysql client on the shell command line, or using the source command within the mysql client.  Note that this method can result very long import times as it fails to take advantage of ClustrixDB parallel processing.  

clustrix_import is a Python script developed by ClustrixDB which reads mysqldump output and loads the data into a ClustrixDB cluster in a multi-threaded fashion. It can be run directly on the cluster (in which case the dump should be staged in the /clustrix directory, which has plenty of scratch space), or from any Linux client with Python 2.4 and MySQLdb (MySQL driver for Python) 1.2.1. 

clustrix_import first connects to the cluster at the given address, and determines the number and IP address of all nodes.  It then spins up 8 (default) threads per node, and distributes the the inserts across these threads.  To obtain optimal slicing, it creates table with a large slice count (3x number of nodes in the cluster), and then reslices smaller tables back to a smaller slice count. It also checks for poor distribution, where an index is dominated by a single value, and fixes this as described in the Distribution Key Imbalances section of Data Distribution.   

In addition to parallelism and optimal data distribution, clustrix_import has a few “handrails”, such as checking whether a binlog exists, since in most cases you would not want all the statements associated with an import going into a replication stream, particularly the DROPs.  Also, clustrix_import has a locking mechanism which prevents multiple instances of clustrix_import from running, since a single instance will already fully utilize cluster resources (additional instances, if allowed, could reduce overall throughput or be destabilizing). 

See also more on clustrix_import

Loading Data Without clustrix_import

If clustrix_import cannot be used to import your data, you can take some proactive measures to ensure efficient data population. While the rebalancer can ultimately rectify just about any problem created during initial data load, poor slicing and distribution can result in much longer import time, and it may take quite some time for the rebalancer to obtain optimal data distribution.

Pre-slicing Tables

number of slices = number of nodes

When populating large tables (say 10GB or larger) it is advantageous to set the table slice count when the table is created and before loading data.  This avoids the problem of "racing the rebalancer", where the rebalancer recognizes that the table needs more slices and so begins splitting, while you are still actively loading data, which results in longer time to load the data. If you can estimate the size of the data you are importing (potentially by importing some number of rows and checking the size in system.table_sizes), a good rule of thumb is a little more than 1 slice per 1GB. Generally, you want at least one slice per node for optimal load distribution – this is the default in v5.0+. Prior to v5.0, you can set the global variable hash_dist_min_slices to the number of nodes to get the same effect. 

To set slice count at table creation time, simply append SLICES=N to the end of your CREATE statement.  You can also reslice the table with ALTER TABLE foo SLICES=N.  Note that in both cases, the slicing for the base representation and all indexes are set to N. 

Pre-slicing for tables > 100GB

For very large tables (larger than 100GB), you may wish to independently set the slice count for the base/primary representation (which contains all columns of each row) and the indexes (which will contain the columns included in the index, as well as the column(s) of the primary key).  Generally, indexes will require fewer slices than the base representation, since the tuples are much narrower; how many fewer slices are required depends on how wide the full table is (particularly how many varchar or blob columns), and whether an index includes such a wide column.  Instead of estimating based on column count/size for indexes, you may also load a small but representative portion of your data into ClustrixDB, and then use the system.index_sizes table to ascertain the relative sizes of the base representation and indices. 

You can set slicing for individual indexes by including SLICES=N after the index definition, before the comma which separates it from the next index, or before the closing parenthesis if the last index:
CREATE TABLE ...
 PRIMARY KEY (`id`),
  KEY `index_bids_on_user_id` (`user_id`) SLICES=12,
  KEY `index_bids_on_timestamp` (`timestamp`) SLICES=24,
  KEY `index_bids_on_auction_id` (`auction_id`) SLICES=12
) SLICES=36;

Note that the slicing given for the table (after the closing parenthesis) applies to the base representation, and any indices which do not specify their own slice count. 

Anticipating table growth

In addition to the guidelines above, consider also how much your table is expected to grow over time – you may want to aim for 0.5GB/slice when you anticipate rapid table growth.   

Setting Distribution for Indexes Dominated by a Single Value

As noted above, clustrix_import takes care of redistributing indexes dominated by a single value.  Again, the rebalancer can also take care of this for you, but you can avoid "racing the rebalancer" by pre-setting distribution to include the primary key by setting DISTRIBUTE=N+1, where N is the number of columns in the index.  

Parallelize Data Load

Aside from initial schema and specifying slicing and distribution, the main factor in import speed is the amount of parallelism. A single-threaded import process fails to take proper advantage of the ClustrixDB parallel architecture, and may run even more slowly than a MySQL instance. Consider how you can divide your data load process up to allow to be performed in parallel.

For LOAD DATA INFILEs, you can split files into smaller chunks and run them in parallel, starting in v4.1 (prior releases locked the destination table for each LOAD DATA INFILE).  

If your application loads data into the database directly, consider whether this data load can be performed in a multi-threaded fashion, with each thread connecting a different session (via the VIP in order to distribute these front-end connections across the cluster noes).  

Use Multi-Row Inserts

Where possible, aggregate single insert statements into larger multi-row statements.  ClustrixDB handles these multi-row statements more efficiently, particularly since it reduces the per-row transactional overhead. Combining parallelism with multi-row inserts should provide optimal data load performance (note that this is exactly what clustrix_import does).

References

  1. More on clustrix_import
  2. Download Clustrix Import