Skip to end of metadata
Go to start of metadata

clx_import is currently offered as a beta

clx_import is a tool for quickly migrating data from various data sources into a ClustrixDB database. It has 3 modes of operation:

  1. (fastest) Import a SQL file created by mysqldump (similar to cat file | mysql, but much faster)

  2. import single delimiter-separated file, similar to mysql’s LOAD DATA INFILE functionality

  3. Import several delimiter-separated files in parallel

clx_import will examine the source file and estimate how tables should be sliced. However, this is just an estimate that does not fully account for the different ways in which data is represented. In most cases, the Rebalancer will further optimize the data balance after the import, which may impact initial performance. See Managing the Rebalancer for more information on monitoring the rebalancer's progress. 

Exporting data from another database

If you are unsure which format to export, use mysqldump with no special options other than the user, password, and tables or DBs to dump, and import with clx_import --sql-file.

If you have a character set other than utf8 set via my.cnf, invoke mysqldump with --default-character-set utf8 . This will not change the character set of the data, or future connections by any application.

Importing from a mysqldump

To import a database exported using mysqldump:

shell> clx_import -u username -H hostname_or_ip -p --sql-file filename
        

Depending on if the dump file contains USE directives , you may need to specify the target database using the option for --db.

Importing from a delimiter-separated file:

To import a delimiter-separated value into an existing table:

shell> clx_import -u username -H hostname_or_ip -p --ldi ldi_definition     

Depending on if the dump file contains USE directives , you may need to specify the target database using the option for --db.

Where ldi_definition is a string matching the following syntax:

--ldi “'file_name'
   [REPLACE | IGNORE]
   INTO TABLE tbl_name
   [CHARACTER SET charset_name]
   [{FIELDS | COLUMNS}
       [TERMINATED BY 'char']
       [[OPTIONALLY] ENCLOSED BY 'char']
       [ESCAPED BY 'char']
   ]
   [LINES
       [STARTING BY 'string']
       [TERMINATED BY 'string']
   ]
   [IGNORE number num_lines]”

Because the whole ldi definition must be a single shell arguments. Since it contains several quote values, it is recommended that it be enclosed in double quotes. Because of this, if tbl_name is enclosed in `backtics`, they must be \`escaped\` to avoid shell expansion.

If no values for FIELDS or LINES are specified, the behavior will be as though the following was specified: 

FIELDS TERMINATED BY '\t' ENCLOSED BY '' ESCAPED BY '\\' LINES TERMINATED BY '\n' STARTING BY ''

Import multiple delimiter-separated files:

To import several delimiter-separated values at once, create an empty directory containing one file called ddl.sql containing CREATE TABLE statements for the tables to be imported. Then create one file per table in said directory with the same name as the table it contains data for. Once this is arranged, invoke clx_import as:

shell> clx_import -u username -H hostname_or_ip -db database -p --csv-dir path_to_data --csv-options csv_options
        

Where path specified contains ddl.sql and the data files. Tbe csv_options is very similar to the ldi_definition defined above, but should not include a filename or the INTO TABLE clause.

Note: When importing multiple files at once, they all must use the same delimiter.


clx_import options

clx_import supports several other options that can be used in various circumstances, but the common use case shouldn’t require these:

--no-auto-slice

clx_import attempts to estimate the size of tables and set the number of slices optimally before importing. This helps prevent performance degradation from reslicing tables after import is finished. However, as it is impossible to estimate on-disk data size from text format data size, this may be undesirable.

If ddl.sql or the sqldump file contains SLICES directives on CREATE TABLE statements, they will be overridden unless this is specified.

-error-file

Failed INSERT statements will be added to this file. Defaults to a temp file that will be printed at the end of the import

--log-file

Logs from the import itself will be put here (i.e. the information shown on the screen during import), defaults to a temp file

--no-binlog

Statements run by the importer will not be binlogged

--insert-ignore

Issues INSERT IGNORE statements instead of INSERT statements. This overrides whatever kind of statements are actually in the dump (and without this, the reverse is true)

--no-globals

Normally clx_import will set various global variables to expedite import performance at the cost of general performance (and unsets them when finished). This skips that step.

--presql

A .sql file to be executed after tables are created and autoslicing has been done, but before any inserts. This file does not support DELIMITER changes (, statements must end with ‘;\n’), and should be the same charset as the file being imported.

Differences between clx_import and LOAD DATA INFILE:

Notable differences between this and standard LOAD DATA INFILE is that

  • setting individual columns (via a SET clause) is not supported

  • The field terminator must be a single character (rather than an arbitrary string)

  • The line terminating character must be some sort of newline (\r, \n, or \r\n) but multiple newlines is not supported

  • LOW PRIORITY and CONCURRENT are not supported

  • clx_import does not support importing tables with ALLNODES. Use Backup and Restore instead. 

Caveats for clx_import

  • clx_import does not support importing dump files collected using the options --complete-insert or --replace
  • clx_import requires Python 2.7
  • No labels