external table to connector relationship

External Tables

External tables are used to access data held outside the Kognitio filesystem, typically in Hive directories or as flat files on HDFS or S3 etc. They use external table connectors to connect to the data source and the details of the data source are shared between the connector and the table definitions.

Typically, the connector will define the location of the file (local HDFS, Amazon S3 bucket etc) and the table definition will define the specific location (directory or object) and format (CSV, JSON) of the data, but the format may be defined by the connector (ORC, Parquet for example).

The image on the right shows how the external table uses one of the external connectors and also shows how it is commonly used as the base for a view that selects the required columns and rows which are then imaged for high speed querying.

Depending on how Kognitio is deployed, a number of pre-installed connectors are installed when the system is created and these are described below.

Pre-installed Connectors

The available pre-installed connectors depend on your Kognitio Deployment Option:

Kognitio on AWS

  • GLUE - This a connector for querying and creating tables defined in AWS Glue Data Catalog. Data can be stored in Avro, ORC, Parquet and delimited text formats (CSV, TSV etc).
  • S3 - This is a block connector which can read flat files (CSV, TSV etc), JSON and Avro format files stored in AWS S3.
  • S3_ORC - This is a connector for reading ORC files stored in AWS S3.
  • S3_PARQUET - This is a connector for reading Parquet files stored in AWS S3.

Note: By default the S3 connectors utilise the credentials of the AWS IAM role specified when creating the cluster with Kognitio Launcher. If this role does not have access to your S3 data then you must specifiy either the IAM Role or the AWS access key and secret key in any S3 external table you create. See the S3 connector attributes for more details.

Kognitio on Hadoop

  • HDFS - This is a block connector which can read flat files (CSV, TSV etc), JSON and Avro format files stored on the local filesystem.
  • HIVE - This is a connector for reading data from the local Hive tables stored in ORC, Parquet and text formats.
  • HDFS_ORC - This is a connector for reading ORC files stored on the local filesystem.
  • HDFS_PARQUET - This is a connector for reading Parquet files stored on the local filesystem.

Note: where connectors are prefixed HDFS they read from the local filesystem defined by the Hadoop fs.defaultFS configuration variable (usually set in core-site.xml). On Azure, this can result in the scheme actually being wasb: or adl: instead of hdfs:.

Kognitio on MapR

  • MAPRFS - This is a block connector which can read flat files (CSV, TSV etc), JSON and Avro format files stored on the local MapR-FS filesystem.
  • HIVE - This is a connector for reading data from the local Hive tables stored in ORC, Parquet and text formats.
  • MAPRFS_ORC - This is a connector for reading ORC files stored on the local MapR-FS filesystem.
  • MAPRFS_PARQUET - This is a connector for reading Parquet files stored on the local MapR-FS filesystem.

Kognitio standalone

There are no pre-installed connectors created on standalone - you or an administrator can use the information in the external table connector documentation to create suitable connectors.

Creating external tables

The examples below show how to create external tables using various connectors, see External Connector Syntax for full details of the connector syntax.

To run the examples below you need:

  1. permissions to create external tables
  2. created a test schema on Kognitio (or replace test with a schema name of your choice).

The standard connectors we can create an external table for fall into several classes:

Block Connector

A block connector is used to read from Text (CSV, TSV etc), JSON and Avro files - examples of block connectors are S3, HDFS, MapR-FS.

For details of how to specify the format of the data in the files see Target String Format Attributes and for details of the options available for load parallelism etc see Target String Attributes.

Example: reading from an HDFS connector

The create external table command for use with the HDFS connector accepts the following specific options - see Create External Table for details of syntax and standard options.

Target String Options
Attribute Type Default Description
file string none The HDFS file name to load. This can also be a wildcard, e.g. /customers/2013/*.csv.
list string   List files in this directory or which match this pattern. If given, a directory listing is returned rather than the file contents, and any file attribute is ignored.
recursive boolean false When listing, recursively descend into directories.
dironly boolean false When listing, return information about the directory itself, not its contents.
allow_empty_dir boolean true If false, then if a path matches only empty directories, it’s an error. If true, it gives 0 rows.
allow_non_match boolean false If true, specifying a path which matches nothing isn’t an error, it gives you 0 rows.

Create the table referencing the location of the files in HDFS - note: this can vary depending on Hadoop distribution:

create external table test.t_hdfs_test (
  col1 int,
  col2 varchar character set utf8,
  pcol int
) from HDFS target 'uri_path "/apps/hive/warehouse/kognitio.db/test/", fmt_field_separator ","';

You can then test this with:

select * from test.t_hdfs_test;

Note: we replaced the type “string” on col2 with “varchar” because Kognitio doesn’t support the string type and specified UTF8 as the character set because this is the default for Hadoop.


Example: reading from a MapR-FS connector

Table Attributes

The create external table command for use with the MapR-FS connector accepts the following specific options - see Create External Table for details of syntax and standard options.

Target String Options
Attribute Type Default Description
file string none The file name to load. This can be a wildcard expression, or a series of file names or expressions separated by spaces.
wholefiles boolean false Do not assign different blocks of a file to separate thread. Make one thread read a whole file.
list string   List files in this directory or which match this pattern. If given, a directory listing is returned rather than the file contents, and any file attribute is ignored.
recursive boolean false When listing files, recursively descend into directories.
dironly boolean false When listing a directory, return information about the directory itself, not its contents.

Create the table referencing the location of the files in MapR-FS:

create external table test.t_maprfs_test (
  col1 int,
  col2 varchar character set utf8,
  pcol int
) from MAPRFS target 'uri_path "/user/hive/warehouse/kognitio.db/test/", fmt_field_separator ","';

You can then test this with:

select * from test.t_maprfs_test;

Note: we replaced the type “string” on col2 with “varchar” because Kognitio doesn’t support the string type and specified UTF8 as the character set because this is the default for Hadoop.


Example: reading Avro data from a block connector

All the block connectors support reading Avro files.

At it’s simplest, all you have to do is set the attribute “fmt_avro” to 1, and set the “file” attribute to point to your Avro files. The connector will interrogate the Avro files and create column definitions to match the Avro schema in the files but options for including partition columns etc are also available.

Attributes

Target String Options
Attribute Type Default Description
fmt_avro boolean false If set, it tells the connector that the files are in Avro format.
fmt_avro_project string none Comma-separated list of Avro fullnames indicating which fields to project.
fmt_avro_json_format boolean none If set, each Avro object will be treated like the equivalent JSON object, and the value of the fmt_avro_project attribute is expected to be a JSON format string, which will be used to project the columns. See the Loading JSON reference sheet for more details.

Create the table referencing the location of the files in (in this case) HDFS - note: this location can vary depending on Hadoop distribution:

create  external table test.t_avro_test
(
  col1 int,
  col2 varchar character set utf8,
  pcol int
)
from HDFS target 'file "/apps/hive/warehouse/kognitio.db/avro_test/", fmt_avro 1, fmt_avro_project "kognitio.col1, kognitio.col2, inputfilename(\"pcol\")"';

Hive stores it’s partition column information in the filename and the “inputfilename” in the “fmt_avro_project” part of the target string tells the connector that the third column should be extracted from the filename and that it’s name is “pcol”. For the columns extracted from the Avro file, we need to supply the Avro full name which is the <avro namespace>.<avro field name> - in this case it will be <hive database>.<column name>.

You can see how the partition information is encoded in the filename with:

hadoop fs -ls -R /apps/hive/warehouse/kognitio.db/avro_test/
drwxrwxrwx   - hive  hdfs          0 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=1
-rwxrwxrwx   2 hive  hdfs        475 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=1/000000_1
drwxrwxrwx   - hive  hdfs          0 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=2
-rwxrwxrwx   2 hive  hdfs        475 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=2/000000_1
drwxrwxrwx   - hive  hdfs          0 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=3
-rwxrwxrwx   2 hive  hdfs        475 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=3/000000_1
drwxrwxrwx   - hive  hdfs          0 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=__HIVE_DEFAULT_PARTITION__
-rwxrwxrwx   2 hive  hdfs        465 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=__HIVE_DEFAULT_PARTITION__/000000_1

To see what the namespace and field names are in an Avro file you can dump it with

$hadoop fs -get /apps/hive/warehouse/kognitio.db/avro_test/pcol=1/000000_1
$java -jar avro-tools-1.8.2.jar getmeta 000000_1
avro.schema     {"type":"record","name":"avro_test","namespace":"kognitio","fields":[{"name":"col1","type":["null","int"],"default":null},{"name":"col2","type":["null","string"],"default":null}]}

You can then test this with:

SELECT * FROM test.t_avro_test;

If you do not have the avro-tools-n.n.n.jar file available, it can be downloaded by following the links from the Avro Download Page.

Notes

Each file matched by the value of the file attribute must be an Avro Object Container File

All Avro files matched by the value of the file attribute must contain an identical Avro schema. If this is not the case, attempting to create the external table will give an error. If after the external table is created, new files are introduced into the filesystem which match any wildcard pattern given in the file attribute and whose schema does not match the other files, behaviour is undefined.

If some of the fields appear in the path name, like this

/user/joe/data/item_manufacturer_id=10/items.avro
/user/joe/data/item_manufacturer_id=20/items.avro

Then in the Avro project string (fmt_avro_project), use inputfilename(“<name>”) to get at the value for that name:

CREATE EXTERNAL TABLE ext_items (
  id INT, manufacturer_id INT, name VARCHAR(100), price DECIMAL(7,2)
) FROM connector_name
TARGET 'fmt_avro 1, file /user/joe/data/*/items.avro,
        fmt_avro_project "itemid, inputfilename(\"item_manufacturer_id\"), itemname, temprice"'

General Block Connector Notes

Large files are divided into blocks and each thread parses a block. Newline characters are taken as record terminators. If the input files contain any newline characters inside fields (quoted CSV files may have this), you need to set the wholefiles attribute.

It is common for text files to be compressed with gzip. Files whose names end with .gz are assumed to be gzip files, and are transparently decompressed by the connector. The connector behaves as if the attribute wholefiles is true for these files.

Glue Connector

See the Glue Connector documentation for syntax and examples.

Hive Connector

See the Hive Connector documentation for syntax and examples.

ORC Connector

For the ORC connector you must put all the columns from the underlying table in and they must be in the correct order. This is because many versions of Hive do not store the column names in the ORC file (see Hive JIRA) - this makes it difficult to validate the external table definition so you will need to take particular care when defining the table structure. Using the Hive Connector will avoid this problem if you have access to the Hive metastore.

Although we can’t check column names in the ORC file, we do check the column names in the partition columns against the column names in the Kognitio external table definition (to ensure that any directory names that look like partitions but aren’t are excluded) so you need to use the exact column name from the Hive definition for the column name in the Kognitio external table definition.

The ORC connector supports column projection so that only the columns required to satisfy a query directly on the table or on a view on the table will be read from disk. Specifying all the columns in the table definition does not circumvent this since the projection list is calculated at compile time and may not actually contain any columns at all (for a count(*) for example) - in this case the ORC reader returns the number of records using the ORC metadata saving a lot of disk IO.

Similarly, queries that can be satisfied entirely from the partition columns (a count(*) grouped on a partition column for example) will use the metadata to improve performance and can be used to get a feel for the partitions of a Hive table without having to create a memory image of the whole table.

The create external table command for use with the ORC connector accepts the following specific options - see Create External Table for details of syntax and standard options.

Target String Options
Attribute Type Default Description
uri_path string   The location of the data - this is added to the uri_scheme from the connector. See below for more details.

The ORC connector uses the JAVA plugin so if the plugin is not already loaded and active run:

create module java mode active;

Then create the connector:

create connector orc_test
  source java
  target 'class com.kognitio.javad.jet.OrcConnector, uri_location "hdfs://cluster1:8020"';

This creates a connector that will read from the HDFS filesystem on cluster1. If you just want to read from the local filesystem, you can specify “default” instead and the connector will use the default filesystem from the “fs.defaultFS” Hadoop configuration parameter.

Create the table referencing the location of the files in HDFS - note: this can vary depending on Hadoop distribution:

create external table test.t_orc_test (
  col1 int,
  col2 varchar character set utf8,
  pcol int
) from orc_test target 'uri_path "/apps/hive/warehouse/kognitio.db/orc_test/"';

You can then test this with:

select * from test.t_orc_test;

Note: we replaced the type “string” on col2 with “varchar” because Kognitio doesn’t support the string type and specified UTF8 as the character set because this is the default for Hadoop.

URI

The location of the data is defined by the URI which is a combination of the uri_location set in the connector definition and the uri_path set in the table definition. The URI is a hadoop filesystem URI such as you might use in a command like “hdfs dfs -ls hdfs://cluster1:8020/user/fred” See Hadoop Filesystem URIs for more details.

Because the location and path are appended, you can create a connector that specifies the initial part of the path so that users of that connector are restricted to a certain set of files.

Example: Creating a restricted connector and external table

Define the restricted connector:

create connector orc_test
  source java
  target 'class com.kognitio.javad.jet.OrcConnector, uri_location "hdfs://cluster1:8020/apps/hive/warehouse/kognitio.db"';

This creates a connector that can only access files from tables in the kognitio schema so creating a table that accesses files in the Hive kognitio.orc_test table looks like this:

create external table test.orc_test (
  col1 int,
  col2 varchar character set utf8,
  pcol int
) from orc_test target 'uri_path "/orc_test/"';

Parquet Connector

The create external table command for use with the Parquet connector accepts the following specific options - see Create External Table for details of syntax and standard options.

The connector accesses data by column name (case insensitive). For each column in the external table definition, it looks in the Parquet files and in the Partition information in the directory name (using the Hive convention of key=value for the directory names). If it can’t find the column in either, it will put null into the column. This allows columns that may or may not be in the underlying Parquet files (for example data that has only been available from a certain date) to be specified in the external table definition.

This means that you must keep the column names the same but you can re-order them (though we would recommend that you make the Kognitio external table the same as the Hive table and select columns via views to minimise the differences between the systems for the users).

If you are using a subset of the columns in the table, it is not necessary to define external tables for the subsets to enable column projection, a single external table defining all the columns should be used and the query compiler will work out what columns are needed and supply the information to the connector so that only the columns required for the query (or create image) are selected from the file.

Target String Options
Attribute Type Default Description
uri_path string   The location of the data - this is added to the uri_scheme from the connector. See below for more details.

The Parquet connector uses the JAVA plugin so if the plugin is not already loaded and active run:

create module java mode active;

Then create the connector:

create connector parquet_test
  source java
  target 'class com.kognitio.javad.jet.ParquetConnector, uri_location "hdfs://cluster1:8020"';

This creates a connector that will read from the HDFS filesystem on cluster1. If you just want to read from the local filesystem, you can specify “default” instead and the connector will use the default filesystem from the “fs.defaultFS” Hadoop configuration parameter.

Create the table referencing the location of the files in HDFS - note: this can vary depending on Hadoop distribution:

create external table test.t_parquet_test (
  col1 int,
  col2 varchar character set utf8,
  pcol int
) from parquet_test target 'uri_path "/apps/hive/warehouse/kognitio.db/parquet_test/"';

You can then test this with:

select * from test.t_parquet_test;

Note: we replaced the type “string” on col2 with “varchar” because Kognitio doesn’t support the string type and specified UTF8 as the character set because this is the default for Hadoop.

URI

The location of the data is defined by the URI which is a combination of the uri_location set in the connector definition and the uri_path set in the table definition. The URI is a hadoop filesystem URI such as you might use in a command like “hdfs dfs -ls hdfs://cluster1:8020/user/fred” See Hadoop Filesystem URIs for more details.

Because the location and path are appended, you can create a connector that specifies the initial part of the path so that users of that connector are restricted to a certain set of files.

Example: Creating a restricted connector and external table

Define the restricted connector:

create connector parquet_test
  source java
  target 'class com.kognitio.javad.jet.OrcConnector, uri_location "hdfs://cluster1:8020/apps/hive/warehouse/kognitio.db"';

This creates a connector that can only access files from tables in the kognitio schema so creating a table that accesses files in the Hive kognitio.parquet_test table looks like this:

create external table test.parquet_test (
  col1 int,
  col2 varchar character set utf8,
  pcol int
) from parquet_test target 'uri_path "/parquet_test/"';

Other Connectors

There are a wide variety of connectors available as documented in external table connectors.

Schema Organisation

Since external tables are normally used to bring external data tables into Kognitio, and because they need to be imaged via a view, the convention is to prefix the table name with t_ and make the view name the same as the table name in the source system. This allows the building of schema and table layouts that mirrors the source system and which therefore allow SQL to run unchanged.

A common use case that this is particularly useful for is to mirror an operational database to provide both analytical query acceleration and reduced load on the source database.