Blogs

Query Hive tables directly from Kognitio

In Kognitio version 8.2.2, you can use our new Hive connector to make your Hive tables appear as Kognitio tables. Kognitio fetches the data directly from its source rather than pulling it through Hive, cutting out the middleman.

A simple example

We create a table in Hive, in this case using Hive’s client tool beeline:

0: jdbc:hive2://localhost:10000> create table hello_world(foo varchar(20), bar varchar(20));
No rows affected (1.479 seconds)
0: jdbc:hive2://localhost:10000> insert into hello_world values ('Hello', 'World');
No rows affected (12.455 seconds)

Then we create an external table in Kognitio. All we need to do is tell Kognitio’s Hive connector what the table’s name is in Hive.

> create external table ext_hello_world from hive target 'table hello_world';
Query 1             1.805s          0 rows affected

> select * from ext_hello_world;
 FOO   | BAR
-------+-------
 Hello | World

Query 2             1.996s          1 row           

How did this work? Kognitio’s Hive connector used the Hive API to find the metadata for the hello_world table. This included the column types (two strings), the data format (delimited text fields), and where Hive stores the data for this table (HDFS, in /user/hive/warehouse/hello_world). This was all the information the Hive connector needed to create an external table using the Kognitio HDFS connector, which fetched the data from HDFS.

Audience

In this article we won’t talk about how to install Kognitio on Hadoop or create a Kognitio cluster – see the Getting Started guide for that. Kognitio’s Hive connector feature is intended for users who have their data in Hive and want to perform analytics on that data using Kognitio’s SQL engine.

We will make these assumptions:

  • You already have a Hadoop cluster with Hive.
  • You’ve installed Kognitio on the Hadoop cluster.
  • You want to be able to see your Hive tables from Kognitio.

Further information about table creation in Hive can be found in the Hive DDL reference.

Limitations

Currently, Kognitio’s Hive connector and related data connectors provide compatibility with Hive tables whose data is in one of the following formats:

  • Delimited text (Hive DDL: ROW FORMAT DELIMITED FIELDS TERMINATED BY … STORED AS TEXTFILE)
  • ORC (Hive DDL: STORED AS ORC)
  • Parquet (Hive DDL: STORED AS PARQUET)

The Hive table can read Hive managed tables and Hive external tables. It can’t read Hive views, because there are no data files for Kognitio to read.

A larger example: IMDB data

In this section, we’ll copy some IMDB data into HDFS, create external tables in Hive that use this data, and then create some (non-external) Hive ORC and Parquet tables containing the same data. Then we’ll create external tables in Kognitio to read all these tables.

The data we’ll be using is from the IMDB.

Download the name.basics.tsv.gz and title.basics.tsv.gz files from IMDB’s dataset download page. Then unzip them and use the hadoop fs -put command to copy them into HDFS. In this example we have an EMR system on Amazon, and we’re copying the data files into the /user/kodoop/ directory on HDFS. kodoop is the name of the user who installed Kognitio.

$ gunzip name.basics.tsv.gz 
$ gunzip title.basics.tsv.gz 
$ hadoop fs -mkdir /user/kodoop/imdb /user/kodoop/imdb/name /user/kodoop/imdb/title
$ hadoop fs -put name.basics.tsv /user/kodoop/imdb/name/
$ hadoop fs -put title.basics.tsv /user/kodoop/imdb/title/

Now, in Hive, we’ll create two external tables which refer to these data files.

-- Hive SQL
create external table ext_imdb_name (
        id varchar(10),
        name varchar(150),
        birth_year int,
        death_year int,
        primary_profession varchar(100),
        known_for_titles varchar(120)
)
row format delimited
fields terminated by '\t'
stored as textfile
location 'hdfs:///user/kodoop/imdb/name/';

create external table ext_imdb_title (
        id varchar(10),
        title_type varchar(40),
        primary_title varchar(500),
        original_title varchar(500),
        is_adult int,
        start_year int,
        end_year int,
        runtime_minutes int,
        genres varchar(100)
)
row format delimited
fields terminated by '\t'
stored as textfile
location 'hdfs:///user/kodoop/imdb/title/';

Creating tables in Kognitio using the Hive connector

We’re going to create external tables in Kognitio which model themselves on the Hive tables we just created. We won’t need to specify the column types, or the physical location of the table’s data – Kognitio gets all of that information from Hive.

The Kognitio SQL syntax to create an external table from a Hive table is as follows:

create external table mytablename
from hive
target 'table hivetablename';

If the HIVE connector doesn’t already exist on your system, you’ll need to create it, as the SYS user, with create connector hive source hive, then grant the connect privilege on the HIVE connector to the desired user (grant connect on connector hive to <user>).

Privileges

To create tables with the Hive connector, a user needs the “connect” privilege on the Hive connector only, not on whatever data connector the task happens to end up delegated to.

An extra step for getting the Hive connector working on EMR

On EMR, the Hive client libraries only exist on the edge node (or “master” node) rather than all the worker nodes. This isn’t particularly convenient, as Kognitio’s software runs on the worker nodes and needs to look things up in Hive. So if you’re on EMR, before you go any further you need to copy the contents of /etc/hive and /usr/lib/hive from the master node to the same place all the other nodes, then restart the Kognitio server (navigate to where you installed the Kognitio software, and run  ./bin/kodoop server <clustername> start).

SQL

In Kognitio, run the following SQL:

> create external table ext_imdb_name from hive target 'table ext_imdb_name';
Query 9             1.113s          0 rows affected

> create external table ext_imdb_title from hive target 'table ext_imdb_title';
Query 10            0.680s          0 rows affected

Now we can query the table in Kognitio as if it’s the Hive table…

> select * from ext_imdb_title
  where primary_title = 'Inception'
  and title_type = 'movie';
 ID        | TITLE_TYPE | PRIMARY_TITLE | ORIGINAL_TITLE | IS_ADULT | START_YEAR | END_YEAR | RUNTIME_MINUTES | GENRES
-----------+------------+---------------+----------------+----------+------------+----------+-----------------+-------------------------
 tt1375666 | movie      | Inception     | Inception      |        0 |       2010 |   <null> |             148 | Action,Adventure,Sci-Fi

Query 11            7.502s          1 row

As in the Hello World example, the Hive connector – a metadata connector – fetched the table’s metadata from Hive. This included the column types, location, data format, delimiter character and so on. Then the Hive connector handed the work of actually fetching the data to Kognitio’s HDFS connector. When we fetch the data, Kognitio connects directly to HDFS rather than going through Hive.

ORC and Parquet tables

By default, Hive tables store their data in HDFS as delimited text files. We can also create Hive tables stored as ORC and Parquet, and read them from Kognitio.

ORC

In Hive, let’s create ORC tables containing the same data as our IMDB “name” and “title” tables…

-- Hive SQL

create table imdb_name_orc (
        id varchar(10),
        name varchar(150),
        birth_year int,
        death_year int,
        primary_profession varchar(100),
        known_for_titles varchar(120)
)
stored as orc;

insert into imdb_name_orc select * from ext_imdb_name;

create table imdb_title_orc (
        id varchar(10),
        title_type varchar(40),
        primary_title varchar(500),
        original_title varchar(500),
        is_adult int, 
        start_year int,
        end_year int,
        runtime_minutes int,
        genres varchar(100)
)
stored as orc;

insert into imdb_title_orc select * from ext_imdb_title;

Now let’s create external tables in Kognitio which fetch data from these Hive tables. Again, we’ll create them from the Hive connector, which will delegate to the Kognitio HDFS_ORC connector. On Kognitio on Hadoop systems, the HDFS_ORC connector should have been created automatically when the system first started up. You don’t need the connect privilege on the HDFS_ORC connector, only on the HIVE connector.

> create external table ext_imdb_name_orc
  from hive
  target 'table imdb_name_orc';
Query 14            0.654s          0 rows affected

> create external table ext_imdb_title_orc
  from hive
  target 'table imdb_title_orc';
Query 15            0.289s          0 rows affected

ORC is a columnar format, and the ORC connector will only fetch the columns it needs for the query. If we’re fetching only one column, the query will run faster.

To illustrate this, compare the timings of the same query on the external table we created earlier, which uses delimited text files, and the external table we’ve just created, which uses ORC:

> select birth_year from ext_imdb_name
  where name = 'Benedict Cumberbatch';
 BIRTH_YEAR
------------
       1976

Query 16            7.747s          1 row           

> select birth_year from ext_imdb_name_orc
  where name = 'Benedict Cumberbatch';
 BIRTH_YEAR
------------
       1976

Query 17            3.791s          1 row

In the first case we had to parse the whole of each delimited text record. In the second case the connector only needed to read the birth_year and name columns of each ORC file. That’s why the second query took less time.

Parquet

Kognitio can also read Hive tables backed by Parquet data. Let’s create some tables stored as Parquet in Hive, and insert data into them:

-- Hive SQL

create table imdb_name_parquet (
        id varchar(10),
        name varchar(150),
        birth_year int,
        death_year int,
        primary_profession varchar(100),
        known_for_titles varchar(120)
)
stored as parquet;

insert into imdb_name_parquet select * from ext_imdb_name;

create table imdb_title_parquet (
        id varchar(10),
        title_type varchar(40),
        primary_title varchar(500),
        original_title varchar(500),
        is_adult int,
        start_year int,
        end_year int,
        runtime_minutes int,
        genres varchar(100)
)
stored as parquet;

insert into imdb_title_parquet select * from ext_imdb_title;

And once again, let’s create external tables on Kognitio which use these Hive tables:

> create external table ext_imdb_name_parquet
  from hive
  target 'table imdb_name_parquet';
Query 2             1.018s          0 rows affected

> create external table ext_imdb_title_parquet
  from hive
  target 'table imdb_title_parquet';
Query 3             1.004s          0 rows affected

Now let’s run some example queries on Kognitio:

> select name, birth_year, known_for_titles
  from ext_imdb_name_parquet
  where id = 'nm0001772';
 NAME            | BIRTH_YEAR | KNOWN_FOR_TITLES
-----------------+------------+-----------------------------------------
 Patrick Stewart |       1940 | tt3315342,tt0120844,tt0117731,tt0120903

Query 4            9.383s          1 row           

> select primary_title, start_year
  from ext_imdb_title_parquet
  where id in ('tt3315342','tt0120844','tt0117731','tt0120903');
 PRIMARY_TITLE            | START_YEAR
--------------------------+------------
 Logan                    |       2017
 Star Trek: First Contact |       1996
 Star Trek: Insurrection  |       1998
 X-Men                    |       2000
 
Query 5             4.316s          4 rows

Partitioned Hive tables

In Hive, tables can be “partitioned”. This means the data files are in multiple directories, with each directory corresponding to one value of the partition column. The column’s value appears in the directory name, in the form name=value. This makes it much quicker for a reader to find all the records corresponding to that value. A table can have multiple partition columns – this leads to a hierarchy of directories.

Let’s continue with our IMDB example, but this time we’ll create a new “title” table in Hive which is partitioned on the START_YEAR column. This means all the records corresponding to a particular START_YEAR value will be grouped into the same subdirectory in HDFS.

Partitioned columns in Hive tables always appear after all other columns in the list.

Creating a partitioned table in Hive

-- Hive SQL

create table imdb_title_partitioned (
        id varchar(10),
        title_type varchar(40),
        primary_title varchar(500),
        original_title varchar(500),
        is_adult int,
        end_year int,
        runtime_minutes int,
        genres varchar(100)
) 
partitioned by (start_year int);

Filling our new partitioned Hive table with data

Now set a special option to tell Hive we want it to sort records into partitions automatically, and insert-select into this table:

-- Hive SQL
set hive.exec.dynamic.partition.mode=nonstrict;
insert into imdb_title_partitioned
partition(start_year)
select id, title_type, primary_title, original_title,
       is_adult, end_year, runtime_minutes, genres, start_year
from ext_imdb_title;

What a partitioned table looks like in HDFS

Note that in HDFS, the /user/hive/warehouse/imdb_title_partitioned/ directory now contains one subdirectory for each distinct START_YEAR value. Depending on your Hadoop distribution, the location in HDFS might be somewhere else, but the directory structure below it will look something like this…

[kodoop@ip-10-2-2-135 ~]$ hadoop fs -ls /user/hive/warehouse/imdb_title_partitioned/
Found 146 items
drwxrwxrwt   - kodoop hadoop          0 2018-07-23 12:22 /user/hive/warehouse/imdb_title_partitioned/start_year=1874
drwxrwxrwt   - kodoop hadoop          0 2018-07-23 12:22 /user/hive/warehouse/imdb_title_partitioned/start_year=1878
drwxrwxrwt   - kodoop hadoop          0 2018-07-23 12:22 /user/hive/warehouse/imdb_title_partitioned/start_year=1881
... <snip> ...
drwxrwxrwt   - kodoop hadoop          0 2018-07-23 12:22 /user/hive/warehouse/imdb_title_partitioned/start_year=2024
drwxrwxrwt   - kodoop hadoop          0 2018-07-23 12:22 /user/hive/warehouse/imdb_title_partitioned/start_year=2025
drwxrwxrwt   - kodoop hadoop          0 2018-07-23 12:22 /user/hive/warehouse/imdb_title_partitioned/start_year=2115
drwxrwxrwt   - kodoop hadoop          0 2018-07-23 12:22 /user/hive/warehouse/imdb_title_partitioned/start_year=__HIVE_DEFAULT_PARTITION__

The special partition value __HIVE_DEFAULT_PARTITION__ indicates that the value is NULL.

Querying the table from Kognitio

We can create an external table in Kognitio to query this new table in the usual way. Kognitio’s Hive connector knows that the table’s data is partitioned, and on what columns. Any queries to the Kognitio external table which ask for specific values for the partitioned column can then be significantly optimised.

To demonstrate this, let’s say we want to find the record for the film Blade Runner, which we know was released in 1982.

If we query a Kognitio external table which uses a non-partitioned Hive table, it takes a few seconds on this system:

> select * from ext_imdb_title
  where start_year = 1982
  and primary_title = 'Blade Runner';
 ID        | TITLE_TYPE | PRIMARY_TITLE | ORIGINAL_TITLE | IS_ADULT | START_YEAR | END_YEAR | RUNTIME_MINUTES | GENRES
-----------+------------+---------------+----------------+----------+------------+----------+-----------------+-----------------
 tt0083658 | movie      | Blade Runner  | Blade Runner   |        0 |       1982 | <null>   |             117 | Sci-Fi,Thriller

Query 3             5.856s          1 row

But if we run the same query on a Kognitio external table which uses a partitioned Hive table, it takes less than a second:

> create external table ext_imdb_title_partitioned
  from hive
  target 'table imdb_title_partitioned';
Query 5             0.683s          0 rows affected

> select * from ext_imdb_title_partitioned
  where start_year = 1982
  and primary_title = 'Blade Runner';
 ID        | TITLE_TYPE | PRIMARY_TITLE | ORIGINAL_TITLE | IS_ADULT | END_YEAR | RUNTIME_MINUTES | GENRES          | START_YEAR
-----------+------------+---------------+----------------+----------+----------+-----------------+-----------------+------------
 tt0083658 | movie      | Blade Runner  | Blade Runner   |        0 | <null>   |             117 | Sci-Fi,Thriller |       1982

Query 6             0.415s          1 row

This is because in the first case we had to read every file, but in the second case we only needed to read the files in …/imdb_title_partitioned/start_year=1982/.

Other possibilities

You can use the Hive connector to access any Hive table stored in a manner for which Kognitio has a connector defined. If you have an external table in Hive which gets its data from an S3 bucket, for example, then provided the data is in one of the supported formats, the Kognitio Hive connector can create an external table on Kognitio to access the same data.

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *