Blogs

Reading AVRO data into Kognitio

AVRO support was introduced in Kognitio 8.2 and in this post I’ll be briefly going through what it is and how to load it from HDFS. This is for Kognitio on Hadoop, if you need to install it see the install guide here.

So what is AVRO?

AVRO is an open source serialization framework for data held in Hadoop taking data and translating it into binary including the data definition or schema, defined in JSON, in the same file. This not only makes the data much more compact but also more efficient as you don’t just have the data but also how to read it as well.

A key feature of AVRO is schema evolution. Data is likely to change over time and fields are added, edited or sometimes missing altogether. Schema evolution ensures that data is compatible regardless of its age making changes much more manageable.

More information can be found here.

Basically there’s a schema:

{
  "type" : "record",
  "name" : "fruit_schema",
  "namespace" : "kogfruit",
  "fields" : [ 
    {"name" : "fruitname", "type" : "string"}, 
    {"name" : "color", "type" : "string"}, 
    {"name" : "amount", "type" : "int"} 
  ]
}

and some data (in this instance we have it in JSON format):

{"fruitname":"apple","color":"red","amount":25}
{"fruitname":"banana","color":"yellow","amount":84}
{"fruitname":"grape","color":"green","amount":39}

Loading AVRO data in Kognitio

I’ve attached two premade AVRO files here which you can put into HDFS and try the examples below. Inside also includes the data and schema definition in JSON which I used to generate the AVRO file. If you’re interested in making your own AVRO file, take a look at this guide I followed to create it.

The AVRO quick reference sheet will also be useful.

We’ll use the fruit data from the previous section but with another file where the the contents are:

{"fruitname":"apple","color":"red","amount":10}
{"fruitname":"banana","color":"yellow","amount":112}
{"fruitname":"grape","color":"green","amount":79}

You’ll also need the HDFS connector which is shipped by default on Hadoop based systems.

Place the file into HDFS and make a note of the directory.

Basic access

Then use the following SQL query to access it, inserting your directory in the file argument:

external table
from HDFS
target '
    fmt_avro 1
    ,file  /user/chakl/depot=1/fruit.avro
';

Syntax explanation:
External table – defines that we’re using an external table
From HDFS – connector to use (HDFS/S3), HDFS ships with Kognitio on Hadoop by default
Target – the details of what you want to access and where it is

It picks up the column names from the schema automatically so you don’t need to do that separately like external tables based on other tables in Kognitio. Although If you want to rename or redefine the column type then you will need to.

This query returns:

kognitio avro reading basic

Specific columns

What if we don’t want all the columns? You can specify just the ones you want with fmt_avro_project, say if we don’t care for the color of the fruit:

external table(
    fruitname varchar(20)
    ,amount int
)
from HDFS
target '
    fmt_avro 1
     ,file  /user/chakl/depot=1/fruit.avro
    ,fmt_avro_project "kogfruit.fruitname, kogfruit.amount"
';

kognitio avro selective columns

In this case we chose to rename the columns because if your AVRO file has a namespace then this needs to be included in fmt_avro_project. This ends up being part of the column name and usually unwanted.

Use JSON formatting options

You may have read JSON in Kognitio before and liked the formatting options. Since AVRO uses JSON for its schema, can we use it for that too? Yes with fmt_avro_json_format:

external table(
    filename varchar(20)
    ,item_no int
    ,fruitname varchar(20)
    ,color varchar(20)
    ,amount int
)
from HDFS
target '
    fmt_avro 1
    ,file  /user/chakl/depot=1/fruit.avro
    ,fmt_avro_json_format 1
    ,fmt_avro_project "APPLY(firstvalid(?,null))
    ,inputbasename()
    ,objectnumber()
    ,fruitname
    ,color
    ,amount
    "
';

kognitio avro json formatting

So we can get the AVRO field names with firstvalid to handle missing values, inputbasename for the filename and objectnumber for a sequence for where it appears. If you interested in more JSON formatting options I wrote a blog about reading it from S3 here.

Fields from path names

New data may arrive in a different folder, so if I had fruit from two different depots they may sit in HDFS like this:

/user/chakl/depot=1/fruit.avro
/user/chakl/depot=2/fruit.avro

This may seem a bit strange to newcomers but this is how Hive stores partition data on Hadoop systems.

There isn’t a column for the depot in the files so we can use inputfilename in the fmt_avro_project to get those levels and identify where the rows of data came from using this with the wildcard:

external table (
    depot int
    ,fruitname varchar (10)
    ,color varchar(10)
    ,amount int
)
from HDFS
target '
    fmt_avro 1
    ,file  /user/chakl/*/fruit.avro
    ,fmt_avro_project "inputfilename(\"depot\"),kogfruit.fruitname,kogfruit.color,kogfruit.amount"
'
;

kognitio avro pathname

Other file formats

Using a file format other than AVRO? Kognitio also supports other Hadoop formats such as Parquet and ORC via external tables as well as traditional CSV and JSON. If you’re interested in a format we haven’t listed or are have other questions, get in touch via our contact page or the community forum.

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *


Get the most from Kognitio

Checking our various hints and tips blogs to get the most from Kognitio.

Read articles