Running Kognitio on Amazon EMR

Video Transcript

Hi there. In this video we’re going to show you how to install Kognitio’s analytical platform for Hadoop on Amazon’s Elastic MapReduce product. Amazon’s Elastic MapReduce product is a turnkey Hadoop cluster in the cloud solution so you can click a few buttons and get a working Hadoop cluster very easily. Kognitio’s product is an analytical SQL platform that sits on top of Hadoop and which can be started up and run very easily as a YARN application so it fits in very well with Elastic MapReduce.

We have a blog post that we’re going to be following in this video and the URL is kognitio.com/kognitio-on-emr so you can read the rationale behind the steps we’ll be taking there.

We’re going to go straight on to installing. First of all we need to prepare an s3 bucket that we can install into Amazon’s environment. So I’ve made an Amazon s3 bucket which I’ve called Kodoop demo and I’ve put four files in it. The first two here have come from our website; the blog post tells you where to get them. And the second two are scripts which I have made from the information in the blog post.

So first one, Master Node Setup is this script and I’ve changed the bucket name here to say Kodoop EMR demo. And then the second script called Node Setup is this one here.
You can just paste these into an editor and upload into an s3 bucket. What I would say is if you do it from Windows these need to be in Unix format. So if you use Notepad and save the files you need to convert them from DOS to Unix before putting them into an s3 bucket otherwise they won’t work.

Now we’re ready to create an EMR cluster. So if we go to Services – Analytics – EMR we can create a cluster. We are going to create a new cluster with the create cluster button. We have to use the advanced options because some of the options we want aren’t in the basic ones.

We’ll use Amazon; this is the EMR version, we’ll just use the latest. We don’t want any of these pieces of software so we’ll turn them off. We do need Zookeeper.

In this box here I usually put an extra line which is from the blog post which just helps with debugging if we need to.

We’re going to add a new step in order to configure the Kognitio on Hadoop software when the EMR cluster comes up. This will mean the cluster will come up ready to create a Kognitio instance.

We are going to create a step of create Custom Jar and we will name it Master Node Setup. The Jar location is the location which is provided by Amazon. This is a special Jar which just runs the script which we’ve already uploaded to the s3 bucket.

We’ll paste in the location of the jar and then the name of the script is the argument so we will put in master setup which is master setup here. And then we’ll say continue on failure – it’s not going to fail so that’s fine. And we won’t auto terminate the cluster after the last steps completed, obviously, because we want to set this up in an interactive way.

Now the next step here is basically choosing hardware that we’re going to use in our EMR cluster. EMR basically works with a master node which drives everything. It has all the main node and YARN resource management processes. Then we have a core group which has all the HDFS storage and the YARN applications can run here. We don’t need task groups, we can get rid of that.

Now in our case I’m going to request spot instances because it’s cheaper. I’m going to choose r3.8xlarge for our node types. For this example we’re going to make a cluster which has approximately 4TB of RAM for use with user data. Each of these r3.8xlarge instances is about a ¼ of a TB so we’re going to use 4 x 4 so I’m going to put 16 instances.

I’m going to put a spot price. We can see a spot price is at 0.5 so I’m just going to put 0.7 to be on the safe side.
This does work in VPC or EC2 classic. Obviously if you choose VPC you get extra instance types that we can’t see for EC2 classic, but for the moment we’re just going to keep it simple. If you’ve used Amazon before and you know what Amazon VPC is then you’ll know what to put there anyway.
So now we’ll go Next.

We can name the cluster. And then we’ll just leave all these settings alone. I’ll put tags. Now we’re going to need name the instances so that the instances that are spun up will show up with its name in the EC2 instance management screens. I’m just going to put Purpose – Demo Video so we can see later what that is.

Bootstrap actions is the next thing we need to do. THe bootstrap action configures each node to run Kognitio software. All this requires in the installation of some extra RPMs. So we’re just going to run a custom action, we’ll configure it, we’ll name it Node Setup and we will put the name of the other script which we uploaded to s3 as a script location. Now this script – you can read what it does in the blog, it mostly just installs some RPM.

Now we can click Next.

And we get to the last page. The EC2 key pair which is basically a pair which allows us to access the Nodes when they come up. I’ve created one called EMR demo and the private key has been downloaded onto this machine. If you don’t already have a key pair, you can go into the EC2 management console and go into this Key Pairs option here and create one. These roles generally can be left alone. They get auto-created when you create the EMR cluster so if you’ve already built one these will be there, otherwise it’ll say something like Create Default for these.

Encryption options we don’t need to set. Although you can if you want.

And the same here with the access groups so the EC2 security groups basically define which ports will be open on the Elastic MapReduce nodes that come up.

OK so by default the EMR nodes can talk to each other but they don’t have any external access and we need to open some ports up so we can SSH in to configure the Kognitio instance and so we can actually access that Kognitio instance.

So I suggest leaving these groups alone because these are autocreated and we don’t want to mess with them, and just add in an extra security group. I’ve precreated one here. I’m going to show you it.

This security group, I’ve essentially just added this IP address which is the computer I’m using. I’ve said Port 22 which is SSH and Port 6550 which is Kognitio’s ODBC port, will be open from the nodes to this computer so I can access my cluster. I;’m just going to choose that an assign it to the master node. I can also assign it to the task nodes. You don’t have to but it just means I can SSH into the core nodes if I want to. If you don’t do that you just SSH into the master node and hop from them.

Now I’m ready. I’m going to hit create cluster and it’s going to build me an EMR cluster.

Now you can see your cluster is starting up. Anytime you click on this cluster list you’ll be able to choose the cluster and get back into this window. We can see it’s provisioning a bunch of Amazon instances. This will take about 10 minutes.

Ok so now we’re done. We’re saying the cluster is in a state of waiting and ready after the last step completed.

You can see the steps we ran here. It added that one for us. Master Node Setup is the step we defined when we were creating the cluster. Everything’s worked, so we can go into the hub and have a look. There’s the master node group with one core node group. We can see all the different instances have been spun up. So now we’re ready to create a Kognitio instance on our cluster.

To do this we are going to SSH into the master node. Essentially the EMR master node is also going to be our Kognitio master node. We call it the Edge node which we’re going to use to define and start up our instance.

Up here the master node’s name has been defined. We’re going to SSH into it. Amazon does give you some handy instructions for how to SSH into it. Basically for a Mac it just says open the terminal and run SSH and that’s what we’re going to do.

So we will copy the name. And we’re going to open up a terminal and we’re going to SSH using the private key unit that goes with the key pair we used when defining the cluster.

That is in Desktop/EMRDemo.pem and we’re going to SSH to a user called kodoop which our master node setup script that we defined has created. And then we’re going to put the node name in, and away we go.

And now we’re in.

We’re on our EMR cluster, so we can list the nodes in the cluster with yarn node-list. And that’s going to give us a list of the names of all the nodes we’ve got.

We’re going to take one of the nodes and we’re going to look at it and what we can do is yarn node-status – these commands are in the blog if you want more detail. And we’re going to paste that in.

And this is going to tell us how much resource each node has available. That’s the amount of memory it has available. This is the amount of VCOREs that the node has available.

We’re going to create a Kognitio instance that uses the whole EMR cluster. You don’t have to do this but we’re going to. First thing we’re going to do is we’re just going to run Kodoop once, without the arguments. And we’re going to accept the ELA and say Yes. And that gives us a bunch of useful links that you can follow later. In particular if you follow this first link, it’s a forum thread that contains links to allow you to download the client tools which you can use to access the instance from other nodes. We’ve already installed the client tools on this node so I’m not going to show you that.

So now I need to create a cluster. I’m going to use this memory capacity – I need to leave 2GB of memory and one VCORE for container management purposes. I’m just going to use CONTAINER_MEMSIZE=239,000. Then I’m going to say CONTAINER_VCORES=63. I’ve got 16 nodes so I’m going to do CONTAINER_COUNT=16. [NOTE Correction to video visual – CONTAINER_COUNT=16 not 63 as shown on video).

And I’m going to create a cluster. Now it is possible to automate this stuff I’m doing manually but in the same way that we automated the master node setup, you could write a script which would basically look at all these nodes and find out this information and automatically set this up.

I‘m going to do it this way. Kodoop create_cluster and I’m going to call it emr.

And there we go. It’s going to look and prepare this. And it’s going to tell me what’s it’s going to do. It’s going to tell me it’s going to make 16 containers, each one is about ¼ of a TB, using nearly all the VCOREs. The external gateway port is the one we’ve allowed through so that’s all good.

So now I’m going to hit enter and it’s going to go and build a cluster.

And my Kognitio instance will be available when that finishes.

And now it’s finished. So our Kognitio instance is now running and we can start an SQL session and interact with it.
The quickest way to start an SQL session is to do kodoop sql emr and the password (this is logging in as the SYS user) is emr which is the same as the connection name.

We’ll just run a quick query to see if all the nodes are there so we’ll just do

select os_node_name, mem_total from ipe_nodeinfo;

Here we go. And there we are. So we’ve go all the nodes in the Amazon EMR core group and the memory size for each node so we’ve got a lot of memory so we can sum that.

select sum(mem_total) from ipe_nodeinfo;

and that should be about 4TB. There we go.

Now really we are going to want to connect from other tools and run SQL.

Very quickly I’ve got Kognitio console (part of the client Kognitio tools bundle) – I‘m going to connect. I haven’t bothered to define a data source. I am going to use the IP address – I‘m going to use the node’s name which came from here.
Password is emr.

And we can connect.

And there we are. That’s connected. So if we click on the system and double click. We can see a brief summary. That’s telling us the same information. It’s changed the RAM GB because this is showing us usable memory for actual data whereas before we were looking at the actual memory the instance has. Sme of the memory is used for compiling and various other things.

So now w’er just going to run a query on this just to show that it works.

New SQL query. I’ll put the same query I ran earlier.

Select *from ipe_nodeinfo and then Execute.

Here we go. Now we can see the various instance names, we’ve even got the Hadoop container names that are running. And now our instance is ready to do some actual real work. The next step would be to do some memory images with data in and run some SQL on it.

The next video in the series will show you some things you can do with a Kognitio cluster.