Using National Character Sets

You can use Kognitio to store national characters based on the syntax extensions to SQL:1999, which use Unicode and ISO standards. Using these extensions, you can store Latin and national characters, including Kanji.

Overview

The best known and most widely used character encoding standard is ASCII, which is based on 7-bit byte character strings and has enough characters to encode English text, but no other major written languages.

ISO has standardized several 8-bit extensions of ASCII for various groups of Latin-based writing systems. Latin-1 supports Western European languages and is widely used. Unicode is the problem-free way to handle written languages that are not in the Latin-1 list.

The Unicode Standard

The Unicode Standard is an effort to encode all the world’s characters in one standard. Unicode encodes over 100,000 characters The Unicode Standard specifies a numeric value and a name for each of its characters. In this respect, it is similar to other character encoding standards such as ASCII.

The range of integers used to code the characters is called the code space. A particular integer in this range is called a code point. When a character is mapped or assigned to a particular code point in the code space, it is referred to as a coded character.

The Unicode Standard defines three encoding forms that allow the same data to be stored and transmitted in a byte, double-byte or quad-byte oriented format (that is, in 8-, 16-, or 32-bits per code unit). All three encoding forms encode the same common characters and can be efficiently transformed into one another without data loss.

The three encoding forms are:

Name

Variable Width

Description

UTF-8

Yes

Stores each code point as a single 8-bit unit (the ASCII characters), or as two, three or four 8-bit sequences.

UTF-16

Yes

Stores each code point using either a single 16-bit unit or as a two 16-bit units.

UTF-32

No

Stores each code point as a 32-bit unit.

All three encoding forms need at most 4 bytes of data for each character.

Note: UCS-2 (2-byte Universal Character Set) is a similar yet older character encoding that was superseded by UTF-16 in Unicode version 2.0, though it still remains in use. The UCS-2 encoding form is identical to that of UTF-16, except that it does not support surrogate pairs and as a consequence it is a fixed-length encoding that always encodes characters into a single 16-bit value.

Kognitio Character Set Specification

The full syntax for specifying a CHAR or VARCHAR field is:

CHAR|VARCHAR (length) [CHARACTER SET character-set]

The character-set specified here is used for storage and conversion. For storage, different character sets take different amounts of space. For example:

Name

Bytes per character

Full Unicode

Variable Width

LATIN1

1

No

No

UCS2

2

No

No

UTF8

1-4

Yes

Yes

UTF16

2 or 4

Yes

Yes

UTF32

4

Yes

No

The character sets recognised by Kognitio are defined in the system tables IPE_CHARACTER_SET and IPE_CHARSET_TRANSLATION; custom 1 byte character sets can be added.

If the character set for a field is not specified, then the system default character set is used. When creating tables, the schema default character set overrides the system default. The system default character set is initially set to LATIN1.

Defaults are set using the following SQL syntax:

CREATE SCHEMA schema DEFAULT CHARACTER SET character-set
ALTER SCHEMA schema SET DEFAULT CHARACTER SET TO character-set
ALTER SYSTEM SET DEFAULT CHARACTER SET TO character-set

CHAR fields can only use fixed width character sets, whereas VARCHAR fields can use either fixed or variable width character sets. For this reason you cannot use a variable width character set as the default character set for a schema.

Normal character operations all work as expected, with automatic conversion of character sets being performed by Kognitio when required.

Notes: the automatic conversion of character sets can result in a performance penalty if, for example, strings have to be converted from a single byte representation to a four byte representation.

At present the upper and lower case functions consider Unicode, however the length of strings cannot change (this is the simplest form of conversion), so, for example, LOWER(‘Δ’) will result in ‘δ’, but UPPER(‘ß’) will not result in ‘SS’.

The character set encoding for a session is determined by the client application and ODBC which will covert if necessary for the server. Kognitio does not support multiple character sets in the same SQL statement.

String Comparison

String comparison and sorting is performed using the UCS_BASIC collation order. Where a character offset of length is used, this will normally be in characters rather than bytes.

String Length

The CHAR_LENGTH function defaults to returning the string length as the number of characters, but, if required, it can return the number of bytes; there is also a function, OCTET_LENGTH which returns the string length as the number of bytes

Entering Unicode

Identifiers can be entered in Unicode and do not need quoting if they are in the regular identifier form specified by SQL (i.e. a letter followed by a sequence of letters, numbers, non-spacing marks, spacing-combining marks, connector punctuation or formatting codes).

For characters not in the session’s character set, there is a Unicode delimited form which allows any Unicode character to be specified using its code point escaped by backslash or another escape character.

The following three string literals are equivalent:

  • ‘Pólya’

  • U&’P\00F3lya’

  • U&’P/00F3lya’ UESCAPE ‘/’

The following four identifiers are equivalent; the first two are equivalent because identifiers are case insensitive.

  • Erdős

  • ERDŐS

  • U&”ERD\0150S”

  • U&”ERD/0150S” UESCAPE ‘/’

Altering a Column’s Character Set Specification

It is possible to change the specification of a string column using:

ALTER TABLE table
ALTER COLUMN column
ALTER TYPE TO CHAR | VARCHAR(length)
CHARACTER SET character-set

This will also cause the underlying data in the columns to be changed to the new character set.

Alternatively, you can inform Kognitio that you want to change the specification of a column but not alter the underlying data, by using:

ALTER TABLE table
ALTER COLUMN column
ALTER CHARACTER SET TREAT AS character-set

Using the ALTER CHARACTER SET TREAT AS variant, you cannot change the number of bytes per character for a CHAR field; so altering from LATIN1 to LATIN9 would be valid, but LATIN1 to UCS2 would not.

In general, you cannot make a CHAR field use a variable width character set.

CHAR vs. VARCHAR for short strings

With multi-byte character sets, the previous wisdom of using fixed-length char fields for small strings might need to be reviewed – if multi-byte characters are rare in for example a UTF-8 encoding, this might be more compact than having to use UTF-32, even for short strings; this is illustrated below:

../_images/data-unicode01.png