Using Kognitio on Amazon Elastic Map Reduce

Amazon’s Elastic Map/Reduce product provides Hadoop clusters in the cloud. We’ve had several requests for the Hadoop version of our product to work with EMR. As of release 8.1.50-rel161221 we have made the two products compatible so you can use EMR to run Kognitio clusters. This article will show you how to get Kognitio clusters up and running on EMR.

In order to run Kognitio on EMR you will need:

This article assumes some basic familiarity with Amazon’s environment and the EMR feature so if you’re new to Amazon you’ll probably want to experiment with it a little first before trying to create a large Kognitio cluster. I’m also assuming that you’re creating a brand new EMR cluster just for Kognitio. If you want to integrate Kognitio with an existing EMR cluster you will need to modify these instructions accordingly.

You can also follow this instructional video:

Getting ready to start

Before you start you’re going to need to decide how to structure the Hadoop cluster and how the Kognitio cluster will look on it. Amazon clusters consist of various groups of nodes – the ‘master node’, which runs Hadoop specific cluster master programs like the HDFS namenode and Yarn resource manager, the ‘Core’ group of nodes, which hold HDFS data and run Yarn containers and optional extra ‘Task’ groups, which run Yarn jobs but don’t hold HDFS data. When running on Hadoop, Kognitio runs as a Yarn application with one or more controlling ‘edge nodes’ that also act as gateways for clients. The Kognitio software itself only needs to be installed on the edge node(s) as the user running it, it gets transfered to other nodes as part of the Yarn task that runs it.

For most EMR clusters it makes sense to use the EMR master node as the Kognitio edge node so that’s how this example will work. There are other possible choices here – you can just use one of the cluster nodes, you can spin up a specific task group node to run it or you can just have an arbitrary EC2 node with the right security settings and client software installed. However the master node is already doing similar jobs and using it is the simplest way to get up and running. For the rest of the cluster it’s easiest to have no task groups and run the whole application on Core nodes, although using task groups does work if you need to do that.

Configuring the master node

The master node also needs to be configured so that it can be used as the controlling ‘edge node’ for creating and managing one or more Kognitio clusters. For this to work you need to create a user for the software to run as, set it up appropriately and install/configure the Kognitio software under that user. Specifically:

  • Create a ‘kodoop’ user
  • Create an HDFS home directory for it
  • Setup authentication keys for it
  • Unpack the kodoop.tar.gz and kodoop_extras.tar.gz tarballs into the user’s home directory
  • Configure slider so it can find the zookeeper cluster we installed
  • Configure the Kognitio software to make clusters that use compressed messages

You can do this with the following shell script:

#!/bin/bash

#change the s3 bucket for your site
S3BUCKET=s3://kognitio-development

sudo useradd -c "kodoop user" -d /home/kodoop -m kodoop
HADOOP_USER_NAME=hdfs hadoop fs -mkdir /user/kodoop
HADOOP_USER_NAME=hdfs hadoop fs -chown kodoop /user/kodoop
sudo cp -r ~ec2-user/.ssh ~kodoop
sudo chown -R kodoop ~kodoop/.ssh

aws s3 cp $S3BUCKET/kodoop.tar.gz /tmp
aws s3 cp $S3BUCKET/kodoop-extras.tar.gz /tmp

sudo su - kodoop <<EOF
tar -xzf /tmp/kodoop.tar.gz
tar -xzf /tmp/kodoop-extras.tar.gz
echo PATH=~/kodoop/bin:\\\$PATH >>~/.bashrc

hn=`hostname`
grep -v '<\/configuration>' kodoop/slider/conf/slider-client.xml >/tmp/slider-client.xml
cat <<XXX >>/tmp/slider-client.xml
  <property>
    <name>slider.zookeeper.quorum</name>
    <value>\$hn:2181</value>
  </property>
</configuration>
XXX
cp  kodoop/slider/conf/slider-client.xml  kodoop/slider/conf/slider-client.xml.orig
cp /tmp/slider-client.xml  kodoop/slider/conf/slider-client.xml

cat >kodoop/config/server_defaults.cfg <<XXX
[runtime parameters]
rs_messcomp=1    ## turn on message compression
XXX
EOF

This script creates the user first, then it pulls the tarballs from an s3 bucket called s3://kognitio-development (You’ll want to change that to be your own bucket’s name and upload the tarballs into it). It then switches to the kodoop user, extracts everything and configures slider. The slider configuration required is the location of the zookeeper server which was installed with the cluster. This will be on port 2181 of the master node and this is the information that goes into slider-client.xml.

The final part of the script defines the rs_messcomp=1 setting for Kognitio clusters created on the EMR instance. This setting enables message compression, which causes messages to get compressed (with the LZ4 compression algorithm) before being sent over a network. This setting is not normally used but we recommend it for Amazon because the network:cpu speed ratio is such that it results in a speedup.

You can transfer this script to the master node and run it as ec2-user once the cluster starts, but it’s a lot nicer to have this run automatically as part of the cluster startup. You can do this by transfering the script to S3 and putting it together in a directory with the tarballs (and editing the s3 bucket name in the script appropriately). You can then specify the script during cluster creation as a custom action to get it run automatically (see below).

Creating the EMR cluster

Go to the Amazon EMR service in the AWS web console and hit ‘create cluster’ to make a new EMR cluster. You will then need to use ‘go to advanced options’ because some of the settings you need are not in the quick options. Now you have 4 pages of cluster settings to go through in order to define your cluster. Once you’ve done this and created a working cluster you will be able to make more clusters by cloning and tweaking a previous one or by generating a command line and running it.

This section will talk you through the settings you need to get a Kognitio cluster running without really getting into the other settings available. The settings I don’t mention can be defined any way you like.

Software Selection and Steps

Choose ‘Amazon’ as the vendor, select the release you want (we’ve tested it with emr-5.2.1 at the time of writing). Kognitio only needs Hadoop and Zookeeper to be selected from the list of packages, although adding others which you may need to run alongside it won’t hurt.

In the ‘Edit software settings’ box you may find it useful to enter the following:

[{"classification":"core-site","properties":{"yarn.nodemanager.delete.debug-delay-sec":"3600"}}]

This instructs yarn to preserve container directories for 1 hour after a container exits, which is very useful if you need to do any debugging.

If you want to have the master node configured automatically as discussed above, you will need to add an additional step here to do that. You can add a step by setting the step type to ‘Custom JAR’ and clicking configure. The Jar Location field should be set to s3://elasticmapreduce/libs/script-runner/script-runner.jar (if you like you can do s3://<regionname>.elasticmapreduce/ to make this a local read) and the argument is the full s3 path for the script you uploaded to s3 in the section above (e.g. s3://kognitio-development/kog-masternode). The script will now run automatically on the masternode after startup and the cluster will come up with a ‘kodoop’ user created and ready to go.

Hardware Selection

In the hardware selection page you need to tell EMR how many nodes to use and which type of VM to use for them. Kognitio doesn’t put much load on the master node so this can be any instance type you like, the default m3.xlarge works well.

The Core nodes can generally be anything which has enough memory for your cluster and the right memory:CPU ratio for you. For optimal network performance you should use the largest of whatever node type instead of a larger number of smaller instances (so 3x r3.8xlarge instead of 6x r3.4xlarge for example). The r3.8xlarge or m4.16xlarge instance types are good choices. You will want to use more RAM than you have data because of the Hadoop overhead and the need for memory workspace for queries. A good rule of thumb is to have the total RAM of the nodes which will be used for the Kognitio cluster be between 1.5x and 2x the size of the raw data you want to load as memory images.

You won’t need any task groups for this setup.

General Cluster Settings and Security

In the ‘General Cluster Settings’ pane you will want to add a bootstrap action for your node. This is required because the AMI used by EMR needs to have a small amount of configuration done and some extra Linux packages installed in order for it to run Kognitio’s software. The best way to do this is to place a configuration script in an S3 bucket and define this as a ‘custom action’ boostrap action. The following script does everything you need:

#!/bin/bash

sudo yum -y install glibc.i686 zlib.i686 openssl.i686 ncurses-libs.i686
sudo mount /dev/shm -o remount,size=90%
sudo rpm -i --nodeps /var/aws/emr/packages/bigtop/hadoop/x86_64/hadoop-libhdfs-*

This script installs some extra Linux packages required by Kognitio. Then it remounts /dev/shm to allow shared memory segments to use up to 90% of RAM. This is necessary because Kognitio clusters use shared memory segments for nearly all of the RAM they use. The final step looks a bit unusual but Amazon doesn’t provide us with a simple way to do this. Kognitio requires libhdfs but Amazon doesn’t install it out of the box unless you install a component which uses this. Amazon runs the bootstrap action before the relevant repositories have been configured on the node so the RPM can’t be installed via yum. By the time we come to use libhdfs all the dependencies will be in place and everything will work.

Finally, the Kognitio server will be accessible from port 6550 on the master node so you may want to configure the security groups in ‘EC2 Security Groups’ to make this accessible externally.

Creating a Kognitio cluster

Once you have started up your cluster and created the kodoop user (either manually or automatically), you are ready to build a Kognitio cluster. You can ssh into the master node as ‘kodoop’ and run ‘kodoop’. This will invite you to accept the EULA and display some useful links for documentation, forum support, etc that you might need later. Finally you can run ‘kodoop testenv’ to validate that the environment is working properly.

Once this is working you can create a Kognitio cluster. You will create a number of Yarn containers with a size you specify. You will need to choose a container size, container vcore count and a number of containers that you want to use for the cluster. Normally you’ll want to use a single container per node which uses nearly all of the memory. You can list the nodes in your cluster on the master node like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -list
17/01/09 18:40:26 INFO client.RMProxy: Connecting to ResourceManager at ip-172-40-0-213.eu-west-1.compute.internal/172.40.0.213:8032
Total Nodes:3
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
ip-172-40-0-91.eu-west-1.compute.internal:8041          RUNNING ip-172-40-0-91.eu-west-1.compute.internal:8042                             1
ip-172-40-0-126.eu-west-1.compute.internal:8041         RUNNING ip-172-40-0-126.eu-west-1.compute.internal:8042                            2
ip-172-40-0-216.eu-west-1.compute.internal:8041         RUNNING ip-172-40-0-216.eu-west-1.compute.internal:8042                            1

Then for one of the nodes, you can find out the resource limits like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -status ip-172-40-0-91.eu-west-1.compute.internal:8041
17/01/09 18:42:07 INFO client.RMProxy: Connecting to ResourceManager at ip-172-40-0-213.eu-west-1.compute.internal/172.40.0.213:8032
Node Report : 
        Node-Id : ip-172-40-0-91.eu-west-1.compute.internal:8041
        Rack : /default-rack
        Node-State : RUNNING
        Node-Http-Address : ip-172-40-0-91.eu-west-1.compute.internal:8042
        Last-Health-Update : Mon 09/Jan/17 06:41:43:741UTC
        Health-Report : 
        Containers : 0
        Memory-Used : 0MB
        Memory-Capacity : 253952MB
        CPU-Used : 0 vcores
        CPU-Capacity : 128 vcores
        Node-Labels :

The ‘Memory-Capacity’ field here shows the maximum container size you can create and CPU-Capacity shows the largest number of vcores. In addition to the Kognitio containers, the cluster also needs to be able to create a 2048MB application management container with 1 vcore. If you set the container memory size to be equal to the capacity and put one container on each node then there won’t be any space for the management container. For this reason you should subtract 1 from the vcore count and 2048 from the memory capacity.

You will also need to choose a name for the cluster which must be 12 characters or less and can only contain lower case letters, numbers and an underscore. Assuming we call it ‘cluster1’ we would then create a Kognitio cluster on the above example cluster like this:

CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1

This will display the following and invite you to confirm or cancel the operation:

[kodoop@ip-172-40-0-213 ~]$ CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1
Kognitio Analytical Platform software for Hadoop ver80150rel170105.
(c)Copyright Kognitio Ltd 2001-2017.

Creating Kognitio cluster with ID cluster1
=================================================================
Cluster configuration for cluster1
Containers:               3
Container memsize:        251904 Mb
Container vcores:         127

Internal storage limit:   100 Gb per store
Internal store count:     3

External gateway port:    6550

Kognitio server version:  ver80150rel170105

Cluster will use 738 Gb of ram.
Cluster will use  up to 300 Gb of HDFS storage for internal data.

Data networks:            all
Management networks:      all
Edge to cluster networks: all
Using broadcast packets:  no
=================================================================
Hit ctrl-c to abort or enter to continue

If this looks OK, hit enter and the cluster will be created. Once creation is completed you will have a working Kognitio server up and running and ready to use.

Next steps

At this point you should have a working Kognitio cluster up and ready to use. If you’re already a Kognitio user you probably know what you want to do next and you can stop reading here. This section is intended as a very brief quickstart guide to give new users an idea of the most common next steps. This is very brief and doesn’t cover all the things you can do. Full documentation for the features discussed below is available from our website.

You can download the Kognitio client tools from www.kognitio.com, install them somewhere, run Kognitio console and connect to port 6550 on the master node to start working with the server. Alternatively you can just log into the master node as kodoop and run ‘kodoop sql <system ID>’ to issue sql locally. Log in as ‘sys’ with the system ID as the password (it is a good idea to change this!).

There are now lots of different ways you can set up your server and get data into it but the most common thing to do is to build memory images (typically view images) to run SQL against. This is typically a two step process involving the creation of external tables which pull external data directly into the cluster followed by the creation of view images on top of these to pull data directly from the external source into a memory image. In some cases you may also want to create one or more regular tables and load data into them using wxloader or another data loading tool, in which case Kognitio will store a binary representation of the data in the HDFS filesystem.

Connecting to data in HDFS

Kognitio on Hadoop starts with a connector called ‘HDFS’ which is configured to pull data from the local HDFS filesystem. You create external tables which pull data from this either in Kognitio console or via SQL. To create external tables using console you can open the ‘External data sources’ part of the object tree and expand ‘HDFS’. This will allow you to browse the object tree from console and you’ll be able to create external tables by right clicking on HDFS files and using the external table creation wizard.

To create an external table directly from SQL you can use a syntax like this:

create external table name (<column list>) from HDFS target 'file /path/to/csv/files/with/wildcards';

Kognito is able to connect to a variety of different data sources and file formats in this manner. See the documentation for full details. As a quick example we can connect to a 6 column CSV file called test.csv like this:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from HDFS target 'file /path/to/file/test.csv';

If instead it is a directory full of csv files we can use ‘/path/to/file/test/*.csv’ instead to use them all as a single table in Kognitio.

Connecting to data in Amazon S3

Kognitio can also pull data directly out of Amazon S3. The Amazon connector is not loaded by default and it isn’t able to use the IAM credentials associated with the EMR nodes so you need to get a set of AWS credentials and configure your server with the following SQL:

create module aws;
alter module aws set mode active;
create group grp_aws;

create connector aws source s3 target 
'
accesskey YOUR_ACCESS_KEY
secretkey "YOUR_SECRET_KEY"
max_connectors_per_node 5
bucket your-bucket-name
';

grant connect on connector aws to grp_aws;
;

This sql loads the Kognitio Amazon plugin, creates a security group to allow access to it and then creates an external table connector which uses the plugin. You will need to give the connector some Amazon credentials where it says YOUR_ACCESS_KEY and YOUR_SECRET_KEY and you will need to point it at a particular storage bucket. If you want to have multiple storage buckets or use multiple sets of credentials then create multiple connectors and grant permission on different ones to appropriate sets of users. Granting the ‘connect’ permission on a connector allows users to make external tables through it. In this case you can just add them to the group grp_aws which has this.

max_connectors_per_node is needed here because the amazon connector gives out of memory errors if you try to run too many instances of it in parallel on each node.

Now an external table can be created in exactly the same way as in the HDFS example. If my amazon bucket contains a file called test.csv with 6 int columns in it I can say:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from AWS target 'file test.csv';

Creating memory images

Once you have external tables defined your server is ready to start running queries, but each time you query an object the server will go out to the remote data and pull it into the server. Kognitio is capable of running like this but most people prefer to create memory images and query those instead because this allows data to be queried very fast. There are several different kinds of memory image in Kognitio but the most commonly used images are view images. With a view image the user defines a view in the normal SQL way and then they image it, which makes an in-memory snapshot of the query. This can be done with this SQL:

create view testv as select * from test;
create view image testv;

So testv is now a memory image. Images can be created with various different memory distributions which tell the server which nodes will store which rows. The most common of these are:

  • Hashed — A hash function on some of the columns determines which nodes get which rows
  • Replicated — Every row goes to every ram processing task
  • Random — Just put the rows anywhere. This what we will get in the example above.

The various memory distributions can be used to help optimise queries. The server will move rows about automatically if they aren’t distributed correctly but placing rows so they are co-located with certain other rows can improve performance. As a general rule:

  • Small tables (under 100M in size) work best replicated
  • For everything else hash on the primary key except
  • For the biggest images which join to non-replicated tables hash on the foreign key to the biggest of the foreign tables
  • Use random if it isn’t obvious what else to use

And the syntax for these is:

create view image test replicated;
create view image test hashed(column, column, column);
create view image test random;

Imaging a view which queries one or more external tables will pull data from the external table connector straight into RAM without needing to put any of it in the Kognitio internal storage. Once the images are built you are ready to start running SQL queries against them.