Avro Connector

This reference sheet describes the creation of external tables used to read Avro files on S3, HDFS or MapR-FS filesystems.

See external table connectors for details of which connector to use for your deployment option.

In the simplest case, all you have to do is set the attribute fmt_avro to 1, and set the file attribute to point to your Avro files but the connector can deal with Hive style partitions structures with a little extra configuration.

Prerequisites

Kognitio server version 8.2.1 or later.

You must be using one of the block-loading connectors, which are S3, HDFS and MapR-FS.

All Avro files referenced by an external table must have an identical schema.

Attributes

Target String Options
Attribute Type Default Description
fmt_avro boolean false If set, it tells the connector that the files are in Avro format.
fmt_avro_project string none Comma-separated list of Avro fullnames indicating which fields to project.
fmt_avro_json_format boolean none If set, each Avro object will be treated like the equivalent JSON object, and the value of the fmt_avro_project attribute is expected to be a JSON format string, which will be used to project the columns. See the Loading JSON reference sheet for more details.

Examples

To run the examples below you need to have permissions to create connectors and external tables, have built the simple Hive tables and have created a “test” schema on kognitio.

The Avro connector uses the JAVA plugin so if the plugin is not already loaded and active run:

CREATE MODULE java MODE active;

Then create a connector for the default file system:

CREATE CONNECTOR hdfs_test SOURCE hdfs;

Create the table referencing the location of the files in HDFS - note: this can vary depending on Hadoop distribution:

CREATE EXTERNAL TABLE test.t_avro_test
(
  col1 int,
  col2 varchar character set utf8
)
FROM hdfs_test TARGET 'file "/apps/hive/warehouse/kognitio.db/avro_test/", fmt_avro 1';

select * from test.t_avro_test;

COL1    COL2
1       Number 1 Partition 1
2       Number 2 Partition 1
3       Number 3 Partition 1
4       Number 4 Partition 1
5       Number 5 Partition 1
6       Number 6 Partition 1
7       Number 7 Partition 1
8       Number 8 Partition 1
9       Number 9 Partition 1
10      Number 10 Partition 1
...

Extracting data from the avro files is easy but we are reading a table built by Hive and the partition column is not in the Avro file but encoded in the filename.

To extract the partition column we can use the inputfilename function in the fmt_avro_project part of the target string to tell the connector that the third column should be extracted from the filename and that it’s name is pcol.

When using this method we need to supply the Avro full name for the columns extracted from the Avro file which is the <avro namespace>.<avro field name> - in this case it will be <hive database>.<column name>.

You can see how the partition information is encoded in the filename with:

hadoop fs -ls -R /apps/hive/warehouse/kognitio.db/avro_test/
drwxrwxrwx   - hive  hdfs          0 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=1
-rwxrwxrwx   2 hive  hdfs        475 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=1/000000_1
drwxrwxrwx   - hive  hdfs          0 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=2
-rwxrwxrwx   2 hive  hdfs        475 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=2/000000_1
drwxrwxrwx   - hive  hdfs          0 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=3
-rwxrwxrwx   2 hive  hdfs        475 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=3/000000_1
drwxrwxrwx   - hive  hdfs          0 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=__HIVE_DEFAULT_PARTITION__
-rwxrwxrwx   2 hive  hdfs        465 2018-09-07 13:58 /apps/hive/warehouse/kognitio.db/avro_test/pcol=__HIVE_DEFAULT_PARTITION__/000000_1

To see what the namespace and field names are in an Avro file you can dump it with

$hadoop fs -get /apps/hive/warehouse/kognitio.db/avro_test/pcol=1/000000_1
$java -jar avro-tools-1.8.2.jar getmeta 000000_1
avro.schema       {"type":"record","name":"avro_test","namespace":"kognitio","fields":[{"name":"col1","type":["null","int"],"default":null},{"name":"col2","type":["null","string"],"default":null}]}

N.B. if you do not have the avro-tools-n.n.n.jar file available, it can be downloaded by following the links from Avro download page.

The external table definition for extracting the partition columns is then:

CREATE EXTERNAL TABLE test.t_avro_testp
(
  col1 int,
  col2 varchar character set utf8,
  pcol int
)
FROM hdfs_test TARGET 'file "/apps/hive/warehouse/kognitio.db/avro_test/",
                       fmt_avro 1,
                       fmt_avro_project "kognitio.col1, kognitio.col2, inputfilename(\"pcol\")"';


select * from test.t_avro_testp;

COL1    COL2                     PCOL
1       Number 1 Partition 1     1
2       Number 2 Partition 1     1
3       Number 3 Partition 1     1
4       Number 4 Partition 1     1
5       Number 5 Partition 1     1
6       Number 6 Partition 1     1
7       Number 7 Partition 1     1
8       Number 8 Partition 1     1
9       Number 9 Partition 1     1
10      Number 10 Partition 1    1

Please note, if you are reading an Avro table created by Hive it will generally be better to use the Hive Connector.

Notes

Each file matched by the value of the file attribute must be an Avro Object Container File

All Avro files matched by the value of the file attribute must contain an identical Avro schema. If this is not the case, attempting to create the external table will give an error. If after the external table is created, new files are introduced into the filesystem which match any wildcard pattern given in the file attribute and whose schema does not match the other files, behaviour is undefined.