wxloader-format - the syntax of a wxloader format string

DESCRIPTION

wxloader uses the conversion arguments supplied to it and the column types of the target table to generate a format string. This format string is sent to the server at the start of the load, and the server uses it to generate code to parse a record.

Normally the user will not need to edit or create their own format string, and the one generated by wxloader will suffice. However, in the case of fixed-width formats, unusual text formats or creative interpretations of the CSV format, the user may need to create a format string of their own and pass that to wxloader.

INTRODUCTION

A wxloader format string has similar, but not identical, syntax and semantics to the format string required by the C library function scanf(3). A comma-separated file containing two integers and two strings intended to be imported into a table with the same column types might use a format string like this:

%d , %d , %[^,] , %[^,\n]\n

Breaking the string down into its individual elements, that tells the Kognitio server to expect a base-10 integer, any amount of spaces (including none), a comma, some more optional spaces, another integer, more spaces, a comma, more spaces, any number of characters up to the next unquoted comma, more spaces, a comma, any number of characters up to the next unquoted comma or newline, and a newline. The newline at the end can either be encoded in the format string as an escape sequence ‘n’ or appear as an actual newline.

Each of these elements is known as a format element. This includes conversions, which begin with the % character, and literal strings. A sequence beginning with % is known as a conversion directive. The structure of a conversion directive is discussed in the next section.

Notes

Any number of spaces in the format string matches any number of spaces, including none, in the input. Note that only the space character is matched in this case; tabs, newlines and other whitespace characters are not matched by spaces in the format string, and such characters in the format string do not count as spaces for this purpose. This differs from how the scanf(3) format string behaves, and is intentional because files with tab-separated fields are very common. If you want to skip any number of tabs, newlines, carriage returns and spaces, use the conversion specifier %*[tnr ].

Unless a conversion directive is specified as unquoted or is fixed-width, any field in the input may be quoted. The server takes care of extracting the data from the quotes and escaping any embedded quotes for you. This means a %d conversion will match 123, “123” and ” 123 “.

Aside from spaces, any characters which are not conversion directives are treated as literals, and must match the input exactly.

A literal ‘n’ character in the format string silently consumes any ‘r’ character at that point in the input before going on to attempt to match the ‘n’. This is so that the same format string can be used for files created on Windows and Unix systems, which have different line-ending conventions.

The following escape sequences may appear in the format string. They are converted to their corresponding characters before the format string is parsed.

Sequence

Meaning

\n

newline (0x0A)

\t

tab (0x09)

\r

carriage return (0x0D)

\\\

backslash (0x5C)

\xNN

two-digit hexadecimal character code

\uNNNN

four-digit hexadecimal character code

A backslash immediately preceding a carriage return or newline causes the carriage return or newline to be ignored and not considered in the format string. A backslash before any character not described above is not an escape sequence; it is a backslash followed by that character. However, some parts of the format string have their own escaping rules on top of these, particularly quoted strings in attribute lists (see Attributes).

STRUCTURE OF A CONVERSION DIRECTIVE

A conversion directive is made up of the following elements, in this order:

  • the % character;

  • optionally, an expression of the form n$, where n is the destination column number;

  • optional flag characters;

  • an optional field width, which should be one or more digits 0-9;

  • optional attributes enclosed with braces ({});

  • the format letter or conversion specifier, which describes what kind of conversion this is.

Only the conversion specifier and the initial % are mandatory.

For example, in the conversion directive:

%4$.10{foo=123, bar="abc"}d

4$ means the result is to go in the fourth column of the table, . is a flag character (see the section on Flags for more information), 10 is the field width, the conversion directive has additional attributes foo and bar which are set to 123 and abc respectively, and the d means it is a decimal integer conversion.

Column number

Normally the order of fields in a file match the order of their corresponding columns in the table. The column number of each conversion directive is therefore implicit from its position in the format string. If this is not the case, you need to specify the column number of every conversion directive.

The syntax is n$ where n is a column number.

If given, it must immediately follow the % sign and precede any flag characters, field width, and attributes. It indicates that the result of this conversion is to be placed in column n. Column numbers start from 1.

If n is less than 1, or greater than the number of columns in the table, this is an error. If the same column number appears more than once in the format string, this is an error. Additionally, as with scanf(3), if a column number is specified for one conversion directive it must be specified for all conversion directives other than %%, %_, %”” or those with the * (ignored) flag. Failure to do this is results in an error LF0008 (Inconsistent column numbers in format string).

Flags

A conversion directive may contain zero or more flag characters after the % character, or after the n$ sequence if there is one. The recognised flag characters are:

Flag

Meaning

*

Expect a field to be here, but then ignore it; don’t put anything into the table. This performs the same function as the -ec option in wximport and the -\-exclude-fields option in wxloader.

-

(hyphen) Asserts that this field will not be quoted, and that quote characters are to be treated like any other character and with no special meaning. In the case of a “naive delimiting” load, all format directives should have this flag.

?

This field is optional. If the input contains something which does not entirely match this conversion directive, skip the conversion directive at no penalty and don’t consume any input. If this format directive is associated with a column, its value will be NULL (unless the field’s notnull attribute is set; see the ATTRIBUTE REFERENCE section below).

^

The opposite of *. Don’t consume any input, but still put something into the column which corresponds to this conversion directive. This flag is mainly useful with the constant attribute, and is used to import a constant into a column where no corresponding field exists in the input file. If the constant attribute is not set, a NULL is inserted into the column (but again, see the semantics of the notnull attribute below).

.

If this flag is given, then the field width is not taken to specify the exact width of the field, but the maximum width. The directive does not describe a fixed-width field. The flag has no effect if no field width is given. Input consumption works the same as if no field width were given, but no more than the specified number of characters are consumed. If this flag is given to a quoted field, then the entire field is consumed but only the first <field width> characters are considered for conversion.

In the case of a %”” (literal) directive, only the ? flag has any effect.

Field width

A field width specifies that the field occupies a fixed number of characters. Exactly this number of characters will be consumed, and quotes are never treated specially. If it is not the case that the entire field can be successfully converted, it is a conversion error. Spaces may pad the input field to make it the correct length; trailing spaces are ignored in all fixed-width fields, and additionally leading spaces are ignored in all fields which do not correspond to a string column.

If the . flag is given, the semantics are different; it is as if no field width were given but a maximum of this number of characters are converted.

Setting the field width has no effect for a %”” (literal) conversion.

Attributes

Up until this point, the syntax of the format string has been broadly similar to that of the scanf(3) format string. However, in order to support a greater set of conversion settings, we extend the syntax to allow arbitrarily-named attributes to be attached to a conversion directive. These attributes are a series of comma-separated name=value pairs that appear in braces immediately before the conversion specifier.

The name can be any nonempty sequence of letters, numbers and underscores. The value can contain anything, but closing braces, commas and double quotes need to be escaped with backslashes. The value can also be put in double quotes, although embedded double quotes and backslashes still need to be preceded with backslashes.

Some attribute names have a special meaning to Kognitio, and affect how the field is converted. Any attributes with an unknown name are ignored. A list of all attributes recognised by Kognitio along with what they mean can be found in the ATTRIBUTE REFERENCE section below.

The attribute list may be empty ({}) in which case it has no effect. It may also include a comma after the last attribute, which is ignored.

A conversion directive with a set of attributes might look like this:

%{name1=value1, name2="value2", defaultval="Bob \"Pie\" Bobertson"}c

Conversion specifier

The conversion specifier indicates what kind of conversion the directive describes. The permissible values are as follows.

%

Matches a literal % character. The full conversion directive is %%. Other than that, this is treated no differently from any other literal.

The directive matches a literal string, which should be terminated with another double quote. The conversion directive %”foo” is equivalent to foo. Backslashes and double quotes within the literal string must be escaped with a backslash (literal backslashes in the string need to be written as \\** because **\\ is a format string escape sequence). Spaces within such a quoted literal are treated as literal spaces, unlike spaces elsewhere in the format string which match any number of spaces. This directive is only useful if you need to attach attributes to a literal, where it makes sense. Currently the only attribute supported for literals is the ? flag, which makes the literal optional. If an optional literal does not completely match the current position in the format string, the directive consumes no input and it is skipped at no penalty. Other attributes may be specified for a literal but they have no effect.

_

(underscore) Matches a number of spaces equal to the field width, or one space if the field width is 0, 1 or not present. This is useful when the format demands an exact number of spaces; a space in the format string matches any number of spaces, which may not be what you want. This conversion is equivalent to a %” (literal) conversion with the appropriate number of spaces in the string. As with literals, attributes may be attached to this conversion but the only one which has any effect is the ? flag.

a

No input is consumed. %a isn’t really a conversion directive as it doesn’t correspond to any particular field or column. Its effect is that any attributes, flags, field widths and so on given to this attribute are automatically applied to any subequent conversion directives which do not set the corresponding value for themselves. This means it sets default attribute values for all format directives which appear subsequently in the format string. This is useful for defining an attribute which applies to every conversion directive in the record. For example, if a file has three DATE fields in it, all with the same unusual date format, you might use the format string: .. code-block:: bash

%{format=”DDMMYYYY”}a%D , %D , %D

which is equivalent to, but less cumbersome than: .. code-block:: bash

%{format=”DDMMYYYY”}D , %{format=”DDMMYYYY”}D , %{format=”DDMMYYYY”}D

If more than one %a directive appears in the format string, each one completely replaces the previous occurrence for the purpose of reading subsequent directives. In other words, later %a directives do not compound upon earlier ones; they replace them.

c

A fixed-length character field. The number of characters to consume is given by the field width; if this is not present, it is taken as 1. If the field width is given as 0, the conversion directive matches all characters up to the end of the record. Conversion stops when a number of characters equal to the field width has been consumed, or if the end of the record is reached before then. Quotes are never treated specially in a fixed-length field; it is as if the - flag is present.

[]

If the conversion specifier is [, it introduces a set of characters, terminated with ]. It behaves like the corresponding scanf(3) conversion directive; the conversion matches any number of characters from the given set of allowed characters. If the set begins with a caret (^) then the match is inverted; we match until we find a character given in the set. Character ranges can be given by putting characters either side of a hyphen (-) e.g. [A-Za-z0-9]. To match a closing square bracket (]), make it the first character after the [ or the ^. To match a hyphen, make it the last character in the set. The attribute term can also be set for this conversion specifier; it specifies a termination sequence to be considered as well as the set of characters. For example, if term is set to wobble, this conversion directive will stop matching if it finds an unquoted wobble in the input, as well as if we see a character that does not match the set specified in the square brackets. This conversion deviates from scanf(3) in two important ways:

  1. If a quoted string is encountered in the input, and the - flag is not given, then everything inside the quoted string is always accepted, regardless of whether it matches the set of characters given in the square brackets. So the conversion directive %[^,] matches any number of characters up to the next unquoted comma, and would be the most likely format directive to use in plain CSV.

  2. In order that this conversion works with empty CSV fields, commonly used to represent NULL, this conversion always matches the empty string. This means it’s not a matching failure if the first character we read doesn’t match; it just means the field is empty.

]

A special case of the %[] conversion directive above, which specifies the inverted empty set of characters. This means it matches all characters but still respects the term attribute. So %{term=”:-)”}] consumes characters until we see an unquoted :-) sequence. A %] directive, that is, an unconstrained pattern match with no termination sequence, will match until the end of the record.

D

Match a date field whose format is given in the format attribute. If the format attribute is not given, the format is YYYY-MM-DD. Example (with default format string): 2010-12-10

d

Match a base-10 integer which may optionally have a leading - or + sign. Examples: 052, 42, 0x2a (all of which are 42)

e

Match a floating-point number. It matches the same as %f but with optionally an exponential part on the end. This exponential part begins with “e” or “E”, then has an optional + or - sign, then has an exponent consisting of at least one decimal digit. Example: -1.23456e-004

f

Match a fixed-point number. Specifically it matches an optional plus or minus sign, followed by any number of digits, and optionally a decimal point character followed by any number of digits. If there are more decimal places than will fit in the column, the value is rounded appropriately; if the two nearest suitable values are equally close then it is rounded so that the last digit is even. For example, if the column is a DECIMAL(3,2), then the value “1.364” would become 1.36, “1.368” would become 1.37, “1.355” would become 1.36, and “1.345” would become 1.34. If the column’s precision is so low that the magnitude of the value in the field is too high for the column, this is a conversion error. Example: -1234.5678

i

If after an optional - or + sign the field begins with “0” %i matches an octal number. If the field begins with “0x” or “0X” it matches a hexadecimal number. Otherwise it matches a decimal number as %d. Example (with default format string): 2010-12-10 12:32:40 Example (with default format string): 12:32:40

I

Match an INTERVAL whose format is given in the format attribute. The syntax of this format string is the same as for a date, time or timestamp format string except that the only format letters supported are Y, M, D, H, N and S. If the format attribute is not given, the format is the same as the ordinary SQL representation of that format type. The following table shows these default interval representations.

Interval Type

Format

Example

YEAR

Y

15

MONTH

M

180

DAY

D

144

HOUR

H

48

MINUTE

N

24

SECOND

S

12

YEAR TO MONTH

Y-M

3-04

DAY TO HOUR

D H

4 10

DAY TO MINUTE

D H:N

27 03:13

DAY TO SECOND

D H:N:S

3 04:21:17

HOUR TO MINUTE

H:N

01:23

HOUR TO SECOND

H:N:S

21:12:32

MINUTE TO SECOND

N:S

4:33

o

Match an octal integer which may optionally have a leading - or + sign.

S

Match a timestamp field whose format is given in the format attribute. If the format attribute is not given, the format is YYYY-MM-DD HH:NN:SS.

T

Match a time field whose format is given in the format attribute. If the format attribute is not given, the format is HH:NN:SS.

x

Match a hexadecimal integer which may optionally have a leading - or + sign. The letters A-F may be uppercase or lowercase in the input.

Nulls

A field is taken as NULL if it is empty; that is, if it contains nothing but spaces. An empty quoted string (“”) is taken as an empty string unless the blanknull attribute for the appropriate conversion directive is set. A field is also taken as NULL if it matches the value of its conversion directive’s nullvalue attribute.

In the event that a field is determined to be NULL for any of the above reasons, the defaultval attribute, if it was set for the conversion directive, is checked first, and the attribute’s value is converted as if it were the input field. If this also converts as NULL, or if the defaultval attribute is not set, then the value loaded into the column as NULL. This behaviour can be modified; see the description of the notnull attribute in the ATTRIBUTE REFERENCE section.

Finally, a date of 0000-00-00 or a timestamp of 0000-00-00 xx:xx:xx (where xx represents any number) is automatically taken as NULL.

ATTRIBUTE REFERENCE

The following attributes are recognised by Kognitio. Attributes which take a “character” can take a single character; any excess characters are ignored. For attributes that take a boolean value, if the value is a nonzero integer or the first character of the value is T, t, Y or y, it is taken as true, otherwise it is taken as false. For attributes that take an integer, it is an error to supply something that isn’t an integer as the value.

allowbool (boolean)

If set for a conversion directive which corresponds to an integer column, then TRUE is accepted as a synonym for 1, and FALSE is accepted as a synonym for 0. The strings are compared case-insensitively.

blanknull (boolean)

If set for a conversion directive which corresponds to a string column then any attempt to import a blank string into the column will instead import a NULL. This is intended for compatibility with programs that export a NULL as two double quote characters. Setting this attribute to true is equivalent to omitting the -z option in wximport. The default is false.

constant (string)

Overrides whatever was read from the field with another string. This is especially useful in conjunction with the ^ flag, which ensures nothing is read. The string given as the value of this attribute is read and converted in accordance with the rest of the conversion directive. If this doesn’t make sense, e.g. %{constant=”foo”}d, then every record will be rejected.

currency (character)

For fixed-point or integer fields, ignore this character if it appears before the number, in between the sign and the number, or after the number. For example, set currency to £ to cope with fields like “£1.02”, “-£1.02”, “£-1.02” and “-1.02£”. By default, no currency symbol is recognised.

defaultval (string)

If the field is absent (the record terminates early) then behave as if the field was the value of this attribute. This attribute is also used if the notnull attribute is set and the input field is a NULL.

dp (character)

Has no meaning for date, time, timestamp, string or integer fields. For fixed-point (%f) fields, it’s the character that represents the decimal point. The default is .

format (string)

Specify the format of a date, time or timestamp, depending on what type of conversion it is. The format syntax can be found in the wxloader man page in the DATE/TIME FORMAT section. The attribute names dateformat, timeformat, timestampformat and intervalformat can also be used; this is useful when setting date, time and timestamp formats in a %a directive. A space in a date, time, timestamp or interval format string does not match any number of spaces, as is the case in the main format string; it matches exactly one space. The default is YYYY-MM-DD for dates, HH:NN:SS for times and YYYY-MM-DD HH:NN:SS for timestamps. The default for intervals is the standard SQL string representation of the specific interval type.

implieddp (boolean)

If true, then for %f conversions whose column type is DECIMAL or NUMERIC, the number given in the field has an implied decimal point. This means it should be read as an integer but then divided by 10^s, where s is the column’s scale. If a decimal point appears in a %f field with the implieddp attribute set, the decimal point is not consumed and conversion stops there. The default is false.

minyear (integer)

This is only meaningful for date and timestamp conversions where the year is expected to be a two-digit number. It specifies that the year is no earlier than the value of this attribute. For example, if minyear is 1970, then a year of 00 to 69 means 2000-2069, and a year of 70 to 99 means 1970-1999. The default is 2000.

notnull (boolean)

If the field is empty, or contains only whitespace, it will be taken as NULL unless notnull is set to true. If notnull is set to true, NULL number fields become zero, NULL string fields become the empty string. For a date, time or timestamp, if no default value is given in defaultval, notnull is fixed to false. The default is false.

nullvalue (string)

A string which represents NULL. If the field matches this string, it will be taken as NULL (unless a default value is given, in which case that is used). If this attribute is set, it does not change the default behaviour that a blank field is taken as NULL.

quotechar (character)

The quote character. This is used for the purpose of determining what is a quoted string. The default is 0x22, the double quote character ().

term (string)

For %[…] conversions, stop if this string is reached in the input, and do not consume it. This is useful for delimited data files where the field separator is more than one character.

thousep (character)

Has no meaning for date, time, timestamp, exponential or string fields. For integer or fixed-point fields, this character is ignored in the input, as it is considered a thousand separator. No check is made that the thousand separators are in the right places, so if the thousand separator is , then “1,234”, “12,34” and even “12,,,34” are all valid and are taken as the number 1234. By default, no thousand separator is expected.

trim (boolean)

If true, fields whose column type is VARCHAR get their trailing spaces removed. The default is false.

truncate (boolean)

If false, a string field that converts to a string too long to fit in the column will give an error and the record will be rejected. If true, such strings will be silently truncated as necessary. The default is false.

unknownchar (character)

Specifies the character to use in place of characters that cannot be imported due to restrictions on a column’s character set. For example, if a string field in a CSV file has a pound sign (£) but the corresponding column only allows ASCII characters, the character given as the value of this attribute will be inserted in place of the pound sign. If this attribute is not set, no such substitution is performed and any record containing uninsertable characters is rejected.

TYPES

There are limitations on which conversion specifiers can be applied to which SQL types. Particularly, %D, %T and %S conversions can only correspond to DATE, TIME and TIMESTAMP columns respectively, and to CHAR and VARCHAR columns. %I conversions can only correspond to INTERVAL columns.

The full matrix of allowed conversions is as follows. A x indicates the conversion is allowed, a - indicates it is not.

diox

c

[

e

f

D

T

S

I

TINYINT

x

x

x

x[1]

x[1]

-

-

-

-

SMALLINT

x

x

x

x[1]

x[1]

-

-

-

-

INTEGER

x

x

x

x[1]

x[1]

-

-

-

-

BIGINT

x

x

x

x[1]

x[1]

-

-

-

-

CHAR

x

x

x

x

x

x

x

x

-

VARCHAR

x

x

x

x

x

x

x

x

-

DECIMAL

x

x

x

x

x

-

-

-

-

REAL

x

x

x

x

x

-

-

-

-

FLOAT

x

x

x

x

x

-

-

-

-

DATE

-

x

x

-

-

x

-

-

-

TIME

-

x

x

-

-

-

x

-

-

TIMESTAMP

-

x

x

-

-

-

-

x

-

INTERVAL

-

x

x

-

-

-

-

-

x

[1] Any non-integer part will be lost.

DOUBLE PRECISION is the same as FLOAT, and NUMERIC is the same as DECIMAL.

For DECIMAL columns, it is not an error if a field contains a value with more decimal places than the column allows. The value will be rounded appropriately, and if the two nearest suitable values are equally close it will be rounded in the direction that makes the last digit even. However, digits to the left of the decimal point will never be lost; if the number is too large for the column, this is an error and the record will be rejected.

EXAMPLES

In the following examples, the newline on the end of the format string is not shown. The file containing the format string, which is passed to wxloader, should normally end with a newline. If your text editor doesn’t automatically put a newline on to the end of a text file if one isn’t present, you need to put one in yourself or add \n to the end of the format string.

CSV

Suppose we have a table defined like this:

create table example1 (
        c1 integer,
        c2 decimal(9, 3),
        c3 float,
        c4 varchar(50),
        c5 timestamp(0)
);

Now suppose we have a file where an example row looks like this:

100,156.944,4.09159e+014,"sponge biscuits",2010-12-10 13:58:01

This is an example of a perfectly formatted CSV record. You wouldn’t need to write your own format string for this; wxloader would do it for you without you even having to specify any conversion options. But for the purpose of illustration, here is a format string which would correctly parse a record like the above:

%d,%f,%e,%[^,],%S
Tolerating spaces

If you try to import this record into the example1 table without supplying a format string, wxloader generates one for you. In this case it generates something slightly different:

%d , %{}f , %e , %{}[^,] , %{}S

If we ignore the empty braces (which have no effect) the only difference is that each element has spaces before and after it. This is so that leading and trailing spaces around fields, provided they are not inside quotes, are skipped. It makes the generated parser more robust in that it can now deal with a record such as:

100, 156.944, 4.09159e+014, "hatstand"  , 2010-12-10 13:58:01
Implied decimal places

Now let’s suppose we have a record whose format is the same as the one above except that no decimal point appears in the second field. We are to infer the presence of a decimal point immediately before the third digit from the right, because the scale of the column is 3. So the example record looks like this:

100, 156944, 4.09159e+014, "hatstand"  , 2010-12-10 13:58:01

The format string might now look like this:

%d , %{implieddp=1}f , %e , %[^,] , %S
Specifying default attributes

Suppose now we have three fixed-point numbers in the file, and the table definition is adjusted accordingly. All the fixed-point fields have implied decimal points. We don’t want to have to put “implieddp=1” in every %f directive, so instead we’ll put the attribute in an %a directive to apply it to all the directives:

%{implieddp=1}a %d , %f , %f , %f , %e , %[^,] , %S
Misordered columns in data file

Let’s go back to our original example1 table. It has an integer, a decimal, a float, a string, and a timestamp, in that order. Suppose we have a file in which the fields are in a different order, like this:

10, 13579.246, "foo", 2010-12-10 13:00:00, -8.022344e+009

We now need to specify column numbers for all our directives that do conversions. Even though only fields 3, 4 and 5 are in the wrong order, we still need to specify the column numbers for the first two fields as well. The format string would look like this:

%1$d , %2$f , %4$[^,] , %5$S , %3$e
Missing fields and optional literals

Now let’s say we have a file containing numbers, names, dates of birth and towns of birth, but some of the records terminate early, e.g.

1,Alice
2,Bob,1982-01-02
3,Charlie,1970-06-01,Reading
4,Dave,1975-12-12,Oxford
5,Edward

We want the missing information to be loaded as NULL. The table is defined like this:

create table birthtowns(
        id int,
        name varchar(20),
        dob date,
        town varchar(20)
);

The format string wxloader generates by default is this:

%d , %{}[^,] , %{}D , %{}[^,\n]

But if we try to load the file with it, it doesn’t work:

Loading into birthtowns from towns.csv

[CONVERSION ERROR] file offset 8, record 2, offset 16
    IE110B: String literal unmatched: expected ",", got "\n"

So we need to specify a default value of the empty string (which is read as NULL) for all fields, we need to make all the literals optional, and string fields need to be prohibited from consuming newlines as well as from consuming commas since every string field might be the last in the record. The correct format string, which is what is generated if you pass the -\-pad-short-records option to wxloader, is this:

%{defaultval=""}a %d %?"," %[^,\n] %?"," %D %?"," %[^,\n]

Note that all the commas have been made optional by specifying them with the %”” syntax and attaching the ? (optional) flag to them.

Constants

Suppose we have a table SALES with three columns: REGION, SALES_DATE and QUANTITY. But what if each region supplies their sales figures in a two-column CSV file containing only the date and quantity? We need to find a way of introducing the REGION name for each record. Let’s say the “Southern” region supplies a CSV file like this:

2010-12-01,140
2010-12-02,100
2010-12-03,220
2010-12-04,80

If we try to load this into the three-column table, we’ll get an error. But all we need to do is tell the server that the first column is always the string “Southern”. We can do this with the following format string:

%^{constant="Southern"}c %D , %d

The ^ flag indicates that the directive must not consume any input, effectively making it a fixed-width field with a field width of zero. The constant attribute specifies the field text. This is converted as if it were an ordinary field, and if the constant value doesn’t make sense for the conversion (e.g. %^{constant=”foo”}d) then every record will be rejected due to conversion errors. The destination column for the constant conversion, like any other conversion, is implicit from its position in the format string unless column numbers are given (see the Column number section earlier). After the load, the table looks like this:

REGION,SALES_DATE,QUANTITY
Southern,2010-12-01,140
Southern,2010-12-02,100
Southern,2010-12-03,220
Southern,2010-12-04,80

Records can now be added to the table from other regions, with the constant conversion directive changed as appropriate, for example:

**%^{constant="Central"}c**.

Fixed width records

If you don’t specify a format string with -F, wxloader assumes the file has a CSV-based format. If you have a fixed-width file, in which field boundaries are determined by their position in the record rather than by the presence of delimiter characters, wxloader can’t generate a format string; you have to write one yourself.

For a true fixed-width format, every conversion directive that consumes data must have a field width. Don’t use spaces, as they match a variable number of characters in the input. If you mix directives with field widths and directives without field widths, and some directives can be quoted, the server may reject your file because it doesn’t know how to split it into records. To require a fixed number of spaces in the record, use the %_ conversion with an appropriate field width, e.g. %10_ to match ten spaces.

Let’s say we have a file of street names, postcodes and grid references. Some typical records might look as follows.

SW1A1AA Buckingham Palace   London              529090179645
SW1A0AA Houses of ParliamentLondon              530268179545
RG121RB Cookham Road        Bracknell           485329169165

It needs to be imported into a table defined like this:

create table postcode_locs(
        postcode char(8),
        street char(20),
        town char(20),
        easting int,
        northing int
);

In the input file, the postcode always takes eight characters, the street name takes the next 20 characters, the town name the next 20, and the easting and northing parts of the grid reference take six characters each. If any field is too short it’s padded with spaces. We’ll assume each record is terminated with a single ‘n’ character. This means every record is exactly 61 characters long.

The format string to describe a record would look like this:

%8c%20c%20c%6d%6d

If this file were produced in a Windows environment and had ‘rn’ at the end of every line rather than just ‘n’, you would need to put a ‘r’ sequence at the end of the format string before the newline. Although ‘n’ matches ‘rn’ as well as ‘n’, if you didn’t also include the ‘r’ the server would think each record was 61 characters instead of 62, and would split the file up incorrectly. Alternatively you could tell the server how big a record is using the wxloader options -\-record-type=fixed and -\-record-width=62.

It’s also advisable to specify a single-byte character set such as LATIN-1 (-c LATIN1) if you don’t have any multibyte characters in your file. Depending on your locale the default might be UTF-8. This is fine since UTF-8 is backwardly compatible with ASCII, but the server can split up records much quicker if it knows a record is a constant number of bytes (as in LATIN-1) rather than a constant number of variable-width characters (as in UTF-8).

Esoterica

Consider the following output of ls -l:

-r-\\-r-\\-r-- 1 root wxadmin 17471 2010-12-10 14:47 wxbackup.1
-r-xr-xr-x 1 root wxadmin  9369 2010-12-10 14:47 wxexport.1
-r-xr-xr-x 1 root wxadmin 23897 2010-12-10 14:47 wximport.1
-r-\\-r-\\-r-- 1 root wxadmin 24292 2010-12-10 14:47 wxrestore.1
-r-xr-xr-x 1 root wxadmin  1318 2010-12-10 14:47 wxsubmit.1

Suppose you want to load these rows into the following table:

create table directory_listing (
        permissions char(10),
        hard_links int,
        user_name varchar(20),
        group_name varchar(20),
        file_size bigint,
        modified_time timestamp(0),
        file_name varchar(256)
);

A suitable format string would be:

%-a%10c %d %[^ ] %[^ ] %d %{format="YYYY-MM-DD HH:NN"}S %[^\n]

The %-a directive at the start applies the - flag to all directives which specifies that no fields are ever quoted. This is required in this case, because mixing fixed-width fields (which may contain quotes that do not delimit strings) with quoted strings (as might be matched by the last directive) is not allowed.

To demonstrate the use of the ? and . flags, let’s consider a data file which describes a route recorded by a GPS unit. We have a table defined thus:

create table gps_track (
        track_seq int,
        track_timestamp timestamp(0),
        lat_degrees int,
        lat_minutes decimal(6, 4),
        lat_sign char(1),
        long_degrees int,
        long_minutes decimal(6, 4),
        long_sign char(1)
);

Each record in the data file contains a sequence number, date, time, latitude and longitude formatted like this:

1,151210,113300,5124.9055,N,00046.4630,W

This means 15th December 2010, at 11:33:00, with a latitude of 51 degrees 24.9055 minutes north, and a longitude of zero degrees 46.4630 minutes west. The NMEA format used by GPS units uses this latitude/longitude format; the above record is a cut-down version of an NMEA line.

The date and time formats are easy enough, but how do we parse the latitude and longitude? We can’t use %d on its own to match the degrees, because it’ll also match the integer part of the minutes as well. We also can’t use %2d (for latitude) because if the field is NULL we’ll read two non-numeric characters and get a conversion error. We either need to give the field a a maximum field width, but not mandate that a fixed number of characters are consumed. %.2d will work for this purpose. The complete format string would be something like this:

%d,%{format="DDMMYY,HHNNSS"}S,%.2d%?f,%[^,],%.3d%?f,%[^,\n]

This would cause the degrees latitude field to consume a maximum of two characters. This is not a fixed-width field; it is a variable-width field with a maximum width. If there is no latitude then the first of the two characters will be a comma, so it will be considered an empty field. The same applies to the degrees longitude field except that has three characters.

Alternatively, this would work:

%-a%d,%{format="DDMMYY,HHNNSS"}S,%?2d%?f,%[^,],%?3d%?f,%[^,\n]

This means the degrees part of the latitude only consumes two characters, and the degrees part of the longitude only consumes three. If the latitude and longitude are not there and the field is empty, It will fail to match and no input will be consumed. Then it’ll attempt to match %?f, fail on that as well, and so set it to NULL. The %-a (no field is quoted) is required, because as explained above you can’t mix quoted and fixed-width fields in a format string without telling the server how to split it into records with -\-record-type.

LIMITATIONS

The wxloader format string is intended to parse flat files where every record refers to the same sort of entity (e.g. a customer or an item), and import the data into one table. It cannot parse structured data formats such as XML. The parser is largely stateless, so any format that requires memory of what has been encountered previously in the record or in the file won’t work.

wxloader aims eventually to replace wximport for all purposes. However, there are some little-used field formats which the IPE_FIELD table can express and which wximport can load, but which wxloader cannot.

SEE ALSO

wxloader, wxloader