Hi and welcome to this quick demo of Kognitio on Hadoop. Kognitio is a mature, ultra-fast SQL engine. It was designed specifically for interactive BI on big data sets.
In this demo we will run a variety of ad-hoc queries against retail data held in Kognitio.
Here, we’ve already set up Kognitio on one of the nodes on an Amazon EMR cluster.
What you can see on screen now is Kognitio’s console for SQL submission and system admin, but you could be using any tool that submits SQL against Kognitio – such as BI tools like Tableau or Apache Zeppelin.
First let’s connect to Kognitio on Hadoop. We’re running Kognitio as an application on YARN on an eight node Amazon EMR cluster giving us approximately 1.8TB of memory and 528 cores.
Now let’s look at the schemas and see what data set we’re working with. Here our data set can be pulled from HDFS and imaged into memory.
We can see a range of different tables in our data set such as Revenue data, Product data, Product groups or departments.
In this demo we’re going to start by looking at the main retail transaction data.
This data contains approximately twenty four billion rows and uses 858 gig which is about 45% of the available RAM on this system.
The main retail data shows date, time, basket identifier, product number, price, store number, till number and sale week.
Let’s ask console to provide some headline information about the data.
Here you can see a prepared SQL query in a console script.
We can step through the SQL using the action buttons.
This query is now running. We’ll switch to the history tab so we can see the timing of the query.
It’ll show the number of rows, the min and max dates in the data and a count distinct to get the number of stores and tills in the data.
As well as our 24 billion rows, we can see we have three years of data from 420 stores which have 10,098 tills between them. The query took 10 seconds
Now let’s look at Week to week performance
This query is typical of a reporting-type query that might be generated automatically by BI products such as Tableau.
This one shows a sales summary of items and value. An SQL windowing function is also used to produce a running total alongside.
The query took just two seconds to execute over the full data set.
Obviously data is stored in multiple tables. So here we can see another reporting query using additional details about the products to show the ten best-selling product groups by value.
The query took just over 3 seconds and we can see that the frozen meals group is the most popular.
This query demonstrates Kognitio’s excellent multi join performance.
Kognitio’s high performance is ideally suited to train of thought analysis, whether this is from direct SQL queries in our examples here, or from prompted / drag and drop reports developed in any BI tool.
To show this, let’s investigate a group of products – defined by a wild card query on product name: salted peanuts.
You could write any query searching a specific subset of your data such as products in a particular campaign or promotion.
The results show there are 15 products whose name matches the search.
We can now use this query as a predicate along with the transaction data to analyse further
Firstly, how many baskets contained these salted peanut products in the year to date.
The query took 2 seconds.
You can see that most baskets contain just one pack of peanuts per transaction. Farmers Boy appears in the highest number of baskets.
We can now turn that around and see what other products appear within these peanut-buying baskets.
This is a common cross purchasing analysis query that requires joining the 24 billion rows of data to itself so it takes a bit longer to run.
Kognitio is specifically designed to handle this kind of heavy processing.
That query took Kognitio just 13 seconds.
Milk is the most common product within the peanut baskets. This is unsurprising as it’s a ubiquitously bought product.
This demo is based on UK retail data so no surprise to see baked beans are up there too!
That’s it for now. This demo has given you a quick taster of the performance of Kognitio for processing SQL on Hadoop.
Kognitio is free to use on Hadoop so download it and see its performance on your own data.
If you want to experiment, you could test Kognitio on an Amazon EMR cluster. Follow the blog link.
If you have any questions when trying to get started, you can post them in our forum.
For more information visit kognitio.com