wxcsvprofile - analyse CSV and suggest appropriate SQL create text

SYNOPSIS

wxcsvprofile [options] file1.csv file2.csv

DESCRIPTION

wxcsvprofile reads a CSV file and produces an SQL “CREATE TABLE” command that would create a table suitable for the data file. Generally, each column will be the most restrictive type that will work for all the values. Every record in the file will be sampled, unless a number of records is given with -r.

In the simplest case, if the input file is called datafile.csv then the create text is written to a file called datafile.sql. If this file already exists, wxcsvprofile tries datafile.sql.1, then datafile.sql.2, and so on, until it finds a file name which does not already exist.

wxcsvprofile normally creates one .sql file for each input file (unless you’ve told it to put all the create statements in one file with the -o option). These .sql files will be placed in the same directories as the respective data files, unless you’ve specified another directory with -O.

When determining the type of a column, the order of precedence is: TIMESTAMP, TIME, DATE, TINYINT, SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, FLOAT, CHAR/VARCHAR. The type given to a column will be the earliest one in that list which matches enough samples. The definition of “enough” is generally “all of them” but this can be changed with the -m option.

If the first record of the file contains fields which can only be strings (meaning they aren’t valid as timestamps, times, dates, integers, decimals or floating point numbers), then this record is assumed not to be a data record but one which contains the column names. The record is therefore not considered for sampling, and these column names are used in the create text. You can override this and tell wxcsvprofile that the first record definitely is (or isn’t) a header using the -\-header (or -\-no-header) option.

OPTIONS

Arguments which are mandatory for the longhand form of options (e.g. -\-create-suffix) are mandatory for their shorthand counterparts (-c) too.

At least one input file must be given. If the input file name is “-“, wxcsvprofile reads from standard input.

-6, -\-version-6

Generate SQL compatible with version 6 Kognitio servers. This means don’t specify character set information for CHAR and VARCHAR columns. Each byte in the file will be assumed to represent one character.

-c, -\-header

Take the first record in the file as a header which contains the names, even if wxcsvprofile thinks it doesn’t look like one. These names will be used in the generated create text. This option is the opposite of -\-no-header.

-C, -\-create-suffix suffix

Specify the suffix to use for the create text file. The data file name’s extension, if it exists, is replaced with this string. If the data file name does not have an extension, this string is appended to the file name. The default is .sql.

-D, -\-date-format format

Specify date format used by the CSV file. The format of the format string is the same as used by wxloader. The default is YYYY-MM-DD.

-d, -\-delete-existing

Don’t use suffixes like .1 and .2 if the destination file already exists; just stomp on any existing file.

-\-disk-only

Append the word “DISK” to the create text, so the table gets created on disk only.

-\-ext-file pattern

Use pattern as the value of the “file” attribute in the target string of the external table create text we generate. This option is only useful in conjunction with the -x option.

-f, -\-field-separator char

Specify the field separator used by the input file. The default is a comma. The argument can be a single character, a hex code preceded with 0x (e.g. 0x7c for a | character), a decimal character code of at least two digits, or an escape sequence such as \t (tab) or \\ (backslash).

-\-force-strings

Force all columns to be CHAR or VARCHAR columns, even if the fields look more like integers, dates, times or anything else.

-h, -\-help

Show help.

-m, -\-error-margin n[%]

The margin of error allowed when deciding what type a column is. The quantity, which may be a number of records or a percentage, specifies the number of samples in a column which may fail to parse as a particular type (“outliers”) before the column is considered not to be that type.

The default is 0, which means no outliers are allowed. This means that in a file of a million rows, if a particular column has the value “1” in 999,999 of the records but “1a” in the millionth record, then the column is taken as a string, not an integer.

Note that if this is set to anything other than 0, wximport or wxloader may give errors upon loading the file into the table. This option is really only useful if the data file is known to have n bad records in it and wximport or wxloader is going to weed them out and discard them.

-N, -\-naive-delimiting

Naive delimiting: don’t treat quotes specially in the input file.

-\-no-header

Don’t take the first record of the file as a column name header, even if wxcsvprofile thinks it looks like one.

-o, -\-output-file file

Specify the single file to which to write all the CREATE TABLE statements. If this is the same file as any of the input files, wxcsvprofile does not touch it; it gives an error and bails out. If the argument is “-“, the create text is written to standard output. Implies -d.

-O, -\-output-directory directory

Specify the directory to write the individual create text files to. By default, each create text file is written to the same directory in which its data file resides.

-\-promote-ascii=yes|no

Specify whether to promote fields which contain only ASCII characters to LATIN1 columns automatically. The default is yes. This is recommended, as LATIN1 columns take up no more space on the server than ASCII7 columns, but can store characters which ASCII7 columns cannot such as pound signs and accented letters.

-\-promote-monotonic=yes|no

If an integer column is deemed to be a TINYINT or SMALLINT but it is monotonically increasing, then promote it to an INTEGER. The default is yes.

This is useful when sampling only the first few hundred rows of a file with an incrementing field (e.g. record number). If only 500 records are sampled, then this column might naively be taken as a SMALLINT because the highest value is 500. However, if there are a million records in the file then a SMALLINT isn’t going to be big enough for any record numbered greater than 32,767. So wxcsvprofile observes that it’s a steadily increasing field and anticipates it might need a larger integer.

-\-promote-real=yes|no

Promote all REAL columns to FLOAT. The default is no.

-Q, -\-quote-character quote

Specify the quote character. The default is a double quote. It can be given as a character, a character code or an escape sequence in the same way as to the -f option.

-q, -\-quiet

Be less verbose.

-r, -\-sample-rows count

Number of rows to sample. If not given, wxcsvprofile reads the whole file.

-\-ram-only

Make create text with “CREATE RAM ONLY TABLE” rather than “CREATE TABLE”.

-S format

Specify timestamp format used by the CSV file. The format of the format string is the same as used by wxloader. The default is YYYY-MM-DD HH:NN:SS.

-\-skip-records count

Skip count records at the start of each file.

-T, -\-timestamp-format format

Specify time format used by the CSV file. The format of the format string is the same as used by wxloader. The default is HH:NN:SS.

-t, -\-table-name name

In all create texts generated, name the table name. If this is not given, the table name will be guessed from the data file name.

-v, -\-verbose

Be more verbose. Repeat for further verbosity.

-\-varchar-threshold count

String columns containing strings of count or more characters are considered to be VARCHAR. String columns containing only shorter strings are CHAR. The default is 32.

-x connector, -\-ext-connector connector

Generate the create text for an external table, using connector as the external connector name. The “file” argument of the target string will be the argument to -\-ext-file.

FILES

Unless instructed otherwise, wxcsvprofile creates one .sql file for each CSV file given on the command line. So if the following command is used:

wxcsvprofile a.csv b.csv c.csv

Three files will be created, a.sql, b.sql and c.sql. Each file will contain the suggested SQL create text for that table. The typical contents of a .sql file generated by wxcsvprofile is as follows:

-- Automatically generated SQL create text
-- Generated by wxcsvprofile 7.01.02-example

-- Input file: a.csv
CREATE TABLE a (
        "C1" INTEGER,
        "C2" DATE,
        "C3" TIME(2),
        "C4" VARCHAR(161) character set LATIN1
);

If wxcsvprofile found any anomalies in the CSV file, for example, records with an unexpected number of fields, warnings are also written to the .sql file as SQL comments. The warnings are also given on the standard error stream.

By default, each .sql file is written to the same directory in which its corresponding .csv file resides. This behaviour can be changed using the -O option (see OPTIONS above). To write all create statements to the same file, specify the file with the -o option.

EXAMPLES

To read all of sales.csv and create a file called sales.sql containing appropriate table create text:

    wxcsvprofile sales.csv

To generate a .sql file for each of the .csv files in the current directory:

    wxcsvprofile *.csv

To write the .sql files to a different directory:

    wxcsvprofile -O ~/create_texts/ *.csv

To read the same .csv files as before, but put all the create statements into one file:

    wxcsvprofile -o create.sql *.csv

To specify a different delimiter, for example, a pipe:

    wxcsvprofile -f ’|’ sales.csv

To specify a tab as the field delimiter, if you don’t want to type a tab on the command line:

    wxcsvprofile -f 0x9 sales.csv

If you have a file which is known to have a few bogus records in it that you don’t mind being rejected by the load process, you can tweak the margin of error. The following command will allow a column to pass as a certain type as long as 99% of its samples match that type:

    wxcsvprofile -m 1% sales_broken.csv

To read a file with a non-standard date format, e.g. American dates:

    wxcsvprofile -D “M/D/YYYY” -S “M/D/YYYY HH:NN:SS” foo.csv

REPORTING BUGS

Please report any bugs to wx2-helpdesk@kognitio.com.

EXIT STATUS

wxcsvprofile returns 1 if it could not generate create text for at least one of the input files, or if an unexpected error occurred. It returns 0 on success.

AUTHOR

Kognitio Ltd.

SEE ALSO

wxloader