A Review of Strata Data Conference, London 2017


Posted By : Mark Chopping 0 Comment
Tags :  

The Strata Data Conference was held at the ExCeL in London this week. Sessions that were of interest to me included:

What Kaggle has learned from almost a million data scientists (Anthony Goldbloom, Kaggle)

This was part of the keynote on the first day. Kaggle is a platform for data science competitions, and have had almost 1 million users participate. Over 4 million models have been submitted to competitions, and this presentation covered off the traits Kaggle have seen for successful entrants.

In particular, for structured data the trend is for competitors to initially explore the data via histograms etc. to get a better understanding of it, then create and select features for use in their approach, which typically involves a classifier. The work on features is more important than the choice of classifier, and successful competitors are usually very creating in choosing features (e.g. car colour type for predicting car resale value), and persistent as most intuitions around feature selection/creation prove to have little correlation with the end goal. Finally, the best competitors tend to use version control for their models, to make it easier to track the success/failure of each approach.

Creating real-time, data-centric application with Impala and Kudu (Marcel Kornacker, Cloudera)

The room was packed for this session as Marcel gave some background on Kudu (a relational store that can be used as an alternative to HDFS) and Impala. He explained that Kudu avoided the limitations on delete, update and streaming inserts  seen with HDFS, and the poor full table scan performance of HBase. As Kudu does not use HDFS for persistence, it handles its own replication, although this means it can’t e.g. benefit from the reduced storage overhead planned for HDFS in Hadoop 3. One workaround would be to only keep e.g. the latest 12 months worth of data in Kudu, and push older data into HDFS to benefit from its reduced storage overhead when that is available.

Kudu has been tested up to 275 nodes in a 3PB cluster, and internally uses columnar format when writing to disk, having collected records in RAM prior to this transposition. It allows range and hash partitioning to be combined. For example, you could use range partitioning by date, but then hash within a date to keep some level of parallelism when dealing with data for one particular date. Currently it only supports single-row transactions but the roadmap includes support for multi-row. From the examples given it appears there are some local predicates it cannot handle (e.g. LIKE with a regular expression), and batch inserts are reportedly slow. Multi-versioning is used as with Kognitio and many other products.

Impala can use all three of those storage options (Kudu, HDFS, HBase), has over 1 million downloads, and is reportedly in use at 60% of Cloudera’s customers.

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop (Marcel Kornacker, Cloudera) 
Marcel started by going through some performance benchmarks involving Impala, and highlighted the importance of moving beyond single user benchmarks.
He then moved onto some techniques for improving Impala performance including:
Partitioning: partitions can be eliminated by join lookup to generate run-time filters (what Kognitio used to call spider joins) – so if joining fact and date tables on a date key and having a local predicate on the data table, then that predicate can be used to generate a list of relevant date keys, and that filter can be applied to the fact table before the join. This appeared to be the biggest factor in Impala’s TPC-DS performance improvements in early 2016. Marcel advised regularly compacting tables to keep file and partition sizes as small as possible, and gave general advice to stick with less than 20,000 partitions (too few and you don’t eliminate enough data with filters, too many and you lose parallelism and put extra load on name node etc.). As in the example above, partition on join keys to get benefit from run-time filters.
Sorting: this will be added in the next release. It is particularly useful as Parquet can store stats on e.g. min and max values within a page, so sorting can help eliminate some of those pages when there are too many column values for partitioning.
Use appropriate data types: some operations are a lot more expensive with different data types (e.g. strings), so try to avoid using these expensive data types. Kognitio used to offer similar advice to customers before modifying their product to make operations like string aggregation as efficient as integer aggregation.
Complex schemas: parent-child relationships with nested collections offer physical colocation of data, giving a natural optimization. Need to use columnar storage for this approach as resulting tables are wide.
Statistics: it takes a long time to collect these, so customers often ask if they can get the same effect with using e.g. optimiser hints to determine the order of joins. That is not the case, as statistics are used for far more than determining join ordering – e.g. scan predicates are order by selectivity and cost, the selectivity of scans is computed, join sides need to be determined for efficiency, join type needs to be decided, a decision on whether to apply run-time filters needs to be made (as presumably the cost of generating and applying these can be significant). The ability to collect statistics on samples is being added, which would speed up the stats collections.
A deep dive into Spark SQL’s Catalyst optimizer (Herman van Hovell tot Westerflier, Databricks)
In an early slide entitled “Why structure” Herman showed the performance benefit of using SQL for a simple aggregation tasks rather than working directly on RDDs with code. He then outlined the approach for query optimization used by Catalyst, from ensuring the query was valid syntactically and semantically, generating a logical plan, optimizing that plan, then generating physical plans which have their cost compared until a final physical plan is selected.
He discussed the use of partial functions to specify transformations of plans (e.g. constant folding), and then showed how it was possible to write your own planner rules to be incorporated into optimization.
It wasn’t clear to me how the optimizer deals with having a vast number of potential rules to apply, with some being inapplicable at one stage in optimization, but then being valid later on in the process after other rules have been applied.
 Journey to AWS: Straddling two worlds (Calum Murray, Intuit)
A very interesting talk on Intuit’s motivation for and subsequent execution of a migration from on-premise software to the cloud (Amazon, in their case).
The  key takeaways were:
  • use a tip-toe approach rather than big-bang. Be in a position where you can flip back and forth from cloud for an application, then swap to cloud initially for a small number of minutes, gradually increasing to 24 hours or more.
  • swim-lane applications first, if possible, to allow for this approach (so you are only exposing a small subset of users to the change initially).
  • consider security implications – in their case, with over 10 million customers, they had to put extra encryption in place, use different encryption keys for on-premise and cloud, etc.

You can find speaker slides and videos from the conference at https://conferences.oreilly.com/strata/strata-eu/public/schedule/proceedings

Exhibition Hall

Conversations at the Kognitio stand in the exhibition hall reflected an increasing interest in how to address SQL on Hadoop problems with company’s installed Hadoop systems. This was a marked step forward from last year’s conference where a lot of the conversations were with companies that were just starting out with Hadoop, and hadn’t done enough with Hive/Impala/Spark SQL to have encountered significant problems that they needed to solve.

Hadoop… Let’s not throw the baby out with the bath water again!


Posted By : Roger Gaskell 0 Comment
Categories :#AnalyticsNews, Blog
Tags :  ,,

Here we go again! Suddenly the industry seems to have turned on Hadoop. Headlines saying “it’s hit the wall” and “it’s failed” have recently appeared and some are suggesting that organisations look at alternative solutions. Granted, Hadoop has its limitations and has not lived up to the massive hype that surrounded it a year or two ago, but then nothing ever does.

I admit I was not a fan of Hadoop when it first appeared; it seemed like a step backwards. It was very complicated to install, unreliable, and difficult to use, but still it caught the industry’s imagination. Engineers liked it because it was “proper engineering” not a shrink wrapped productionised product, and the business was seduced by the idea of free software. Pretty quickly it became an unstoppable runaway train and the answer “to life the universe and everything” was no longer 42 but Hadoop.

Great expectations generally lead to disappointment and this is Hadoop’s problem. We hyped it up to such an extent that it was always going to be impossible for it to live up to the expectations, no matter how much it improved, and it has, immeasurably! Hadoop is following the Gartner Hype Cycle (one of the cleverest and most accurate representations of how the perception of technology evolves) perfectly. It’s just for Hadoop the curve is enormous!

So what do I mean by let’s not throw out the baby with the bathwater again? In Hadoop’s early days the hot topic was NoSQL. The message was SQL was dead. The problem with SQL was that it was difficult to write the complicated mathematical algorithms required for Advanced Analytics and, as the name suggests, it relies on the data having structure. Advanced analytical algorithms are easier to implement, and unstructured data easier to handle, in languages such as “R” and Python. All perfectly true, but advanced analytics is just the tip of the data analytics triangle and the rest of the space is very well served by traditional SQL. Traditional BI reporting and self-service data visualization tools are still massively in demand and generally use SQL to access data. Even unstructured data is usually processed to give it structure before it is analysed. So when the NoSQL bandwagon claimed SQL was dead, they were effectively throwing out the most widespread and convenient method of business users getting access to Hadoop based data, in favor of something that only developers and data scientists could use.

Of course sense eventually prevailed, NoSQL morphed into Not-Only SQL, and now everyone and his brother is now implementing SQL on Hadoop solutions. The delay has been costly though and the perceived lack of fully functional, high performance SQL support is one of the key reasons why Hadoop is currently under pressure. I say perceived because there are already very good SQL on Hadoop solutions out there if people are willing to look outside the Apache box, but this is not a marketing piece so I will say no more on that subject. My point is that the IT industry has a history of using small weaknesses to suddenly turn on otherwise very useful technologies. There will always be those whose interests are best served by telling the industry that something is broken and we need to throw it away and start again. The IT industry’s problem is that it is often too easily led astray by these minority groups.

Hadoop has come a long way in a short time and although it has problems there is a large community of people working to fix them. Some point to the lack of new Apache Hadoop projects as a sign of Hadoop’s demise; I would argue that this is a positive thing with the community now focused on making and finding stuff that works properly rather constantly focusing on the shiniest, new cool project! I think that Hadoop is finally maturing.

This post first appeared on LinkedIn on April 18, 2017.

Hadoop’s biggest problem, and how to fix it


Posted By : Mark Chopping Comments are off
Tags :  ,


Hadoop was seen as a silver bullet for many companies, but recently there has been an increase in critical headlines like:

  1. Hadoop Has Failed Us, Tech Experts Say
  2. You’re doing Hadoop and Spark wrong, and they will probably fail
  3. Has Hadoop Failed? That’s the Wrong Question

The problem

Dig behind the headlines, and a major issue is the inability for users to query data in Hadoop in the manner they are used to with commercial database products.

From the Datanami article:

  • Hadoop’s strengths lie in serving as a cheap storage repository and for processing ETL batch workloads, Johnson says. But it’s ill-suited for running interactive, user-facing applications
  • It’s better than a data warehouse in that have all the raw data there, but it’s a lot worse in that it’s so slow
  • “At the Hive layer, it’s kind of OK. But people think they’re going to use Hadoop for data warehouse…are pretty surprised that this hot new technology is 10x slower that what they’re using before,” Johnson says. “[Kudo, Impala, and Presto] are much better than Hive. But they are still pretty far behind where people would like them to be.”

The Register article based on a Gartner research talk recognises Hadoop’s strength for ETL processing, but highlights the issues with SQL-handling on Hadoop.

The Podium Data article states “Hadoop is terrible as a relational database”, and “Hadoop failed only in the sense that inflated expectations could never be met compared to mature commercial offerings.”

“The Growing Need for SQL for Hadoop” talks about the need for SQL for Hadoop. The ideal is to be “on Hadoop”, and thus processing data within the Hadoop cluster, rather than “off Hadoop” where data has to be extracted from Hadoop for processing.

Similarly, Rick van der Lans talks about “What Do You Mean, SQL Can’t Do Big Data?”, emphasising the need for SQL solutions when working with big data platforms.

RCA of the problem

There can be many reasons for current SQL-on-hadoop products not being performant.

Possibilities include:

  • overhead of starting and stopping processes for interactive workloads – to run relatively simple queries quickly, you need to reduce latency. If you have a lot of overhead for starting and stopping containers to run tasks, that is a big impediment to interactive usage, even if the actual processing is very efficient
  • product immaturity – a lot of commercial databases have built on the shoulders of giants. For example, this wiki lists a set of products that derive from PostgreSQL, including Greenplum, Netezza, ParAccel, Redshift, Vertica. This gives these products a great start in avoiding a lot of mistakes made in the past, particularly in areas such as SQL optimisation. In contrast, most of the SQL-on-hadoop products are built from scratch, and so developers have to learn and solve problems that were long-since addressed in commercial database products. That is why we see great projects like Presto only starting to add a cost-based optimiser now, and Impala not being able to handle a significant number of TPC-DS queries (which is why Impala TPC-DS benchmarks tend to show less than 80 queries, rather than the full 99 from the query set).
  • evolution from batch processing – if a product like Hive starts off based on Map-Reduce, its developers won’t start working on incremental improvements to latency, as they won’t have any effect. Similarly, if Hive is then adopted for a lot of batch processing, there is less incentive to work on reducing latency. Hive 2 with LLAP project aims to improve matters in this area, but in benchmarks such as this AtScale one reported by Datanami it still lags behind Impala and SparkSQL.


Whilst benchmarks show that SQL on Hadoop solutions like Hive, Impala and SparkSQL are all continually improving, they still cannot provide the performance that business users need.

Kognitio have an SQL engine originally developed for standalone clusters of commodity servers, and used by a host of enterprise companies. Due to this heritage, the software has a proven history of working effectively with tools like Tableau and MicroStrategy, and delivering leading SQL performance with concurrent query workloads – just the sort of problems that people are currently trying to address with data in Hadoop. The Kognitio SQL engine has been migrated to Hadoop, and could be the solution a lot of users of Hive, Impala and SparkSQL need today.

It has the following attributes:

  • free to use with no limits on scalability, functionality, or duration of use
  • mature in terms of query optimisation and functionality
  • performant, particularly with concurrent SQL query workloads
  • can be used both on-premise and in the cloud

For further information about Kognitio On Hadoop, try:


This post first appeared on LinkedIn on March 23, 2017.

Strata + Hadoop World – San Jose


Posted By : Sharon Kirkham Comments are off
Categories :#AnalyticsNews, Blog

The Kognitio team had a great trip to Strata + Hadoop World in San Jose last week and we would like to say a big thank you to everyone who stopped by for a chat about getting enterprise level performance for their SQL on Hadoop. We look forwarding to hearing from you when you try out Kognitio on Hadoop.

At the start of the conference we released our benchmarking whitepaper in which Kognitio outperformed Impala and Spark in a TPC-DS benchmarking exercise. This proved to be of great interest and kept us all really busy on the stand. Conversations ranged from people who have been using Hadoop a while and are having problems serving data to their end-user applications such as Tableau and Qliksense right through to those that are just starting out on their Hadoop journey and wanted to understand what Kognitio can bring to their solution stack.

The subject matter of the conference sessions indicates that there is a period of consolidation going on within the Apache® Hadoop® solution stack. Most topics were discussing how to get the most from more established projects and the challenges of enterprise adoption. There was very little new research presented which was a bit disappointing.


Marcel Kornacker and Mostafa Mokhtar from Cloudera presented a talk on optimising Impala performance that was really interesting. They had also been using the TPC-DS query set for benchmarking but obviously had to use a cut down version of the query set (75 out of 99 queries). The optimisation details will be useful for us to follow for Impala when we do the next round of benchmarking after Kognitio 8.2 is released in April. Their benchmarks were at the 1 TB and 10TB scale. Increasing scale to 10TB and concurrency above 10 streams is something that we would definitely like to do during the next set of benchmarks.

From a maths perspective it was great to see Bayesian inference in the data science mix. Michael Lee Williams from Fast Forward Labs presented a great overview. I will certainly be checking out some of algorithms and tools with a view to parallelising them within Kognitio’s external scripting framework.

Data streaming also continues to be at the forefront of the conference . It was clear from the number of sessions in the conference that more companies (such as Capital One) have experiences they want to share as well as plenty of contributions from established technology leaders such as Confluent. It is certainly something that we are thinking about here.

If you didn’t make it to our booth at San Jose we hope to see you at one of these upcoming events:

DWS17, Munich, Sponsor, Big Data

We’ll be on Booth #1003.

See us at the next Strata Data Conference in London

23-25 May 2017

Booth #511.


Using Kognitio on Amazon Elastic Map/Reduce


Posted By : Andy MacLean Comments are off
Kognitio on Amazon EMR

Using Kognitio on Amazon Elastic Map Reduce

Amazon’s Elastic Map/Reduce product provides Hadoop clusters in the cloud. We’ve had several requests for the Hadoop version of our product to work with EMR. As of release 8.1.50-rel161221 we have made the two products compatible so you can use EMR to run Kognitio clusters. This article will show you how to get Kognitio clusters up and running on EMR.

In order to run Kognitio on EMR you will need:

This article assumes some basic familiarity with Amazon’s environment and the EMR feature so if you’re new to Amazon you’ll probably want to experiment with it a little first before trying to create a large Kognitio cluster. I’m also assuming that you’re creating a brand new EMR cluster just for Kognitio. If you want to integrate Kognitio with an existing EMR cluster you will need to modify these instructions accordingly.

You can also follow this instructional video:

Getting ready to start

Before you start you’re going to need to decide how to structure the Hadoop cluster and how the Kognitio cluster will look on it. Amazon clusters consist of various groups of nodes – the ‘master node’, which runs Hadoop specific cluster master programs like the HDFS namenode and Yarn resource manager, the ‘Core’ group of nodes, which hold HDFS data and run Yarn containers and optional extra ‘Task’ groups, which run Yarn jobs but don’t hold HDFS data. When running on Hadoop, Kognitio runs as a Yarn application with one or more controlling ‘edge nodes’ that also act as gateways for clients. The Kognitio software itself only needs to be installed on the edge node(s) as the user running it, it gets transfered to other nodes as part of the Yarn task that runs it.

For most EMR clusters it makes sense to use the EMR master node as the Kognitio edge node so that’s how this example will work. There are other possible choices here – you can just use one of the cluster nodes, you can spin up a specific task group node to run it or you can just have an arbitrary EC2 node with the right security settings and client software installed. However the master node is already doing similar jobs and using it is the simplest way to get up and running. For the rest of the cluster it’s easiest to have no task groups and run the whole application on Core nodes, although using task groups does work if you need to do that.

Configuring the master node

The master node also needs to be configured so that it can be used as the controlling ‘edge node’ for creating and managing one or more Kognitio clusters. For this to work you need to create a user for the software to run as, set it up appropriately and install/configure the Kognitio software under that user. Specifically:

  • Create a ‘kodoop’ user
  • Create an HDFS home directory for it
  • Setup authentication keys for it
  • Unpack the kodoop.tar.gz and kodoop_extras.tar.gz tarballs into the user’s home directory
  • Configure slider so it can find the zookeeper cluster we installed
  • Configure the Kognitio software to make clusters that use compressed messages

You can do this with the following shell script:


#change the s3 bucket for your site

sudo useradd -c "kodoop user" -d /home/kodoop -m kodoop
HADOOP_USER_NAME=hdfs hadoop fs -mkdir /user/kodoop
HADOOP_USER_NAME=hdfs hadoop fs -chown kodoop /user/kodoop
sudo cp -r ~ec2-user/.ssh ~kodoop
sudo chown -R kodoop ~kodoop/.ssh

aws s3 cp $S3BUCKET/kodoop.tar.gz /tmp
aws s3 cp $S3BUCKET/kodoop-extras.tar.gz /tmp

sudo su - kodoop <<EOF
tar -xzf /tmp/kodoop.tar.gz
tar -xzf /tmp/kodoop-extras.tar.gz
echo PATH=~/kodoop/bin:\\\$PATH >>~/.bashrc

grep -v '<\/configuration>' kodoop/slider/conf/slider-client.xml >/tmp/slider-client.xml
cat <<XXX >>/tmp/slider-client.xml
cp  kodoop/slider/conf/slider-client.xml  kodoop/slider/conf/slider-client.xml.orig
cp /tmp/slider-client.xml  kodoop/slider/conf/slider-client.xml

cat >kodoop/config/server_defaults.cfg <<XXX
[runtime parameters]
rs_messcomp=1    ## turn on message compression

This script creates the user first, then it pulls the tarballs from an s3 bucket called s3://kognitio-development (You’ll want to change that to be your own bucket’s name and upload the tarballs into it). It then switches to the kodoop user, extracts everything and configures slider. The slider configuration required is the location of the zookeeper server which was installed with the cluster. This will be on port 2181 of the master node and this is the information that goes into slider-client.xml.

The final part of the script defines the rs_messcomp=1 setting for Kognitio clusters created on the EMR instance. This setting enables message compression, which causes messages to get compressed (with the LZ4 compression algorithm) before being sent over a network. This setting is not normally used but we recommend it for Amazon because the network:cpu speed ratio is such that it results in a speedup.

You can transfer this script to the master node and run it as ec2-user once the cluster starts, but it’s a lot nicer to have this run automatically as part of the cluster startup. You can do this by transfering the script to S3 and putting it together in a directory with the tarballs (and editing the s3 bucket name in the script appropriately). You can then specify the script during cluster creation as a custom action to get it run automatically (see below).

Creating the EMR cluster

Go to the Amazon EMR service in the AWS web console and hit ‘create cluster’ to make a new EMR cluster. You will then need to use ‘go to advanced options’ because some of the settings you need are not in the quick options. Now you have 4 pages of cluster settings to go through in order to define your cluster. Once you’ve done this and created a working cluster you will be able to make more clusters by cloning and tweaking a previous one or by generating a command line and running it.

This section will talk you through the settings you need to get a Kognitio cluster running without really getting into the other settings available. The settings I don’t mention can be defined any way you like.

Software Selection and Steps

Choose ‘Amazon’ as the vendor, select the release you want (we’ve tested it with emr-5.2.1 at the time of writing). Kognitio only needs Hadoop and Zookeeper to be selected from the list of packages, although adding others which you may need to run alongside it won’t hurt.

In the ‘Edit software settings’ box you may find it useful to enter the following:


This instructs yarn to preserve container directories for 1 hour after a container exits, which is very useful if you need to do any debugging.

If you want to have the master node configured automatically as discussed above, you will need to add an additional step here to do that. You can add a step by setting the step type to ‘Custom JAR’ and clicking configure. The Jar Location field should be set to s3://elasticmapreduce/libs/script-runner/script-runner.jar (if you like you can do s3://<regionname>.elasticmapreduce/ to make this a local read) and the argument is the full s3 path for the script you uploaded to s3 in the section above (e.g. s3://kognitio-development/kog-masternode). The script will now run automatically on the masternode after startup and the cluster will come up with a ‘kodoop’ user created and ready to go.

Hardware Selection

In the hardware selection page you need to tell EMR how many nodes to use and which type of VM to use for them. Kognitio doesn’t put much load on the master node so this can be any instance type you like, the default m3.xlarge works well.

The Core nodes can generally be anything which has enough memory for your cluster and the right memory:CPU ratio for you. For optimal network performance you should use the largest of whatever node type instead of a larger number of smaller instances (so 3x r3.8xlarge instead of 6x r3.4xlarge for example). The r3.8xlarge or m4.16xlarge instance types are good choices. You will want to use more RAM than you have data because of the Hadoop overhead and the need for memory workspace for queries. A good rule of thumb is to have the total RAM of the nodes which will be used for the Kognitio cluster be between 1.5x and 2x the size of the raw data you want to load as memory images.

You won’t need any task groups for this setup.

General Cluster Settings and Security

In the ‘General Cluster Settings’ pane you will want to add a bootstrap action for your node. This is required because the AMI used by EMR needs to have a small amount of configuration done and some extra Linux packages installed in order for it to run Kognitio’s software. The best way to do this is to place a configuration script in an S3 bucket and define this as a ‘custom action’ boostrap action. The following script does everything you need:


sudo yum -y install glibc.i686 zlib.i686 openssl.i686 ncurses-libs.i686
sudo mount /dev/shm -o remount,size=90%
sudo rpm -i --nodeps /var/aws/emr/packages/bigtop/hadoop/x86_64/hadoop-libhdfs-*

This script installs some extra Linux packages required by Kognitio. Then it remounts /dev/shm to allow shared memory segments to use up to 90% of RAM. This is necessary because Kognitio clusters use shared memory segments for nearly all of the RAM they use. The final step looks a bit unusual but Amazon doesn’t provide us with a simple way to do this. Kognitio requires libhdfs but Amazon doesn’t install it out of the box unless you install a component which uses this. Amazon runs the bootstrap action before the relevant repositories have been configured on the node so the RPM can’t be installed via yum. By the time we come to use libhdfs all the dependencies will be in place and everything will work.

Finally, the Kognitio server will be accessible from port 6550 on the master node so you may want to configure the security groups in ‘EC2 Security Groups’ to make this accessible externally.

Creating a Kognitio cluster

Once you have started up your cluster and created the kodoop user (either manually or automatically), you are ready to build a Kognitio cluster. You can ssh into the master node as ‘kodoop’ and run ‘kodoop’. This will invite you to accept the EULA and display some useful links for documentation, forum support, etc that you might need later. Finally you can run ‘kodoop testenv’ to validate that the environment is working properly.

Once this is working you can create a Kognitio cluster. You will create a number of Yarn containers with a size you specify. You will need to choose a container size, container vcore count and a number of containers that you want to use for the cluster. Normally you’ll want to use a single container per node which uses nearly all of the memory. You can list the nodes in your cluster on the master node like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -list
17/01/09 18:40:26 INFO client.RMProxy: Connecting to ResourceManager at ip-172-40-0-213.eu-west-1.compute.internal/
Total Nodes:3
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
ip-172-40-0-91.eu-west-1.compute.internal:8041          RUNNING ip-172-40-0-91.eu-west-1.compute.internal:8042                             1
ip-172-40-0-126.eu-west-1.compute.internal:8041         RUNNING ip-172-40-0-126.eu-west-1.compute.internal:8042                            2
ip-172-40-0-216.eu-west-1.compute.internal:8041         RUNNING ip-172-40-0-216.eu-west-1.compute.internal:8042                            1

Then for one of the nodes, you can find out the resource limits like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -status ip-172-40-0-91.eu-west-1.compute.internal:8041
17/01/09 18:42:07 INFO client.RMProxy: Connecting to ResourceManager at ip-172-40-0-213.eu-west-1.compute.internal/
Node Report : 
        Node-Id : ip-172-40-0-91.eu-west-1.compute.internal:8041
        Rack : /default-rack
        Node-State : RUNNING
        Node-Http-Address : ip-172-40-0-91.eu-west-1.compute.internal:8042
        Last-Health-Update : Mon 09/Jan/17 06:41:43:741UTC
        Health-Report : 
        Containers : 0
        Memory-Used : 0MB
        Memory-Capacity : 253952MB
        CPU-Used : 0 vcores
        CPU-Capacity : 128 vcores
        Node-Labels :

The ‘Memory-Capacity’ field here shows the maximum container size you can create and CPU-Capacity shows the largest number of vcores. In addition to the Kognitio containers, the cluster also needs to be able to create a 2048MB application management container with 1 vcore. If you set the container memory size to be equal to the capacity and put one container on each node then there won’t be any space for the management container. For this reason you should subtract 1 from the vcore count and 2048 from the memory capacity.

You will also need to choose a name for the cluster which must be 12 characters or less and can only contain lower case letters, numbers and an underscore. Assuming we call it ‘cluster1’ we would then create a Kognitio cluster on the above example cluster like this:

CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1

This will display the following and invite you to confirm or cancel the operation:

[kodoop@ip-172-40-0-213 ~]$ CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1
Kognitio Analytical Platform software for Hadoop ver80150rel170105.
(c)Copyright Kognitio Ltd 2001-2017.

Creating Kognitio cluster with ID cluster1
Cluster configuration for cluster1
Containers:               3
Container memsize:        251904 Mb
Container vcores:         127

Internal storage limit:   100 Gb per store
Internal store count:     3

External gateway port:    6550

Kognitio server version:  ver80150rel170105

Cluster will use 738 Gb of ram.
Cluster will use  up to 300 Gb of HDFS storage for internal data.

Data networks:            all
Management networks:      all
Edge to cluster networks: all
Using broadcast packets:  no
Hit ctrl-c to abort or enter to continue

If this looks OK, hit enter and the cluster will be created. Once creation is completed you will have a working Kognitio server up and running and ready to use.

Next steps

At this point you should have a working Kognitio cluster up and ready to use. If you’re already a Kognitio user you probably know what you want to do next and you can stop reading here. This section is intended as a very brief quickstart guide to give new users an idea of the most common next steps. This is very brief and doesn’t cover all the things you can do. Full documentation for the features discussed below is available from our website.

You can download the Kognitio client tools from www.kognitio.com, install them somewhere, run Kognitio console and connect to port 6550 on the master node to start working with the server. Alternatively you can just log into the master node as kodoop and run ‘kodoop sql <system ID>’ to issue sql locally. Log in as ‘sys’ with the system ID as the password (it is a good idea to change this!).

There are now lots of different ways you can set up your server and get data into it but the most common thing to do is to build memory images (typically view images) to run SQL against. This is typically a two step process involving the creation of external tables which pull external data directly into the cluster followed by the creation of view images on top of these to pull data directly from the external source into a memory image. In some cases you may also want to create one or more regular tables and load data into them using wxloader or another data loading tool, in which case Kognitio will store a binary representation of the data in the HDFS filesystem.

Connecting to data in HDFS

Kognitio on Hadoop starts with a connector called ‘HDFS’ which is configured to pull data from the local HDFS filesystem. You create external tables which pull data from this either in Kognitio console or via SQL. To create external tables using console you can open the ‘External data sources’ part of the object tree and expand ‘HDFS’. This will allow you to browse the object tree from console and you’ll be able to create external tables by right clicking on HDFS files and using the external table creation wizard.

To create an external table directly from SQL you can use a syntax like this:

create external table name (<column list>) from HDFS target 'file /path/to/csv/files/with/wildcards';

Kognito is able to connect to a variety of different data sources and file formats in this manner. See the documentation for full details. As a quick example we can connect to a 6 column CSV file called test.csv like this:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from HDFS target 'file /path/to/file/test.csv';

If instead it is a directory full of csv files we can use ‘/path/to/file/test/*.csv’ instead to use them all as a single table in Kognitio.

Connecting to data in Amazon S3

Kognitio can also pull data directly out of Amazon S3. The Amazon connector is not loaded by default and it isn’t able to use the IAM credentials associated with the EMR nodes so you need to get a set of AWS credentials and configure your server with the following SQL:

create module aws;
alter module aws set mode active;
create group grp_aws;

create connector aws source s3 target 
secretkey "YOUR_SECRET_KEY"
max_connectors_per_node 5
bucket your-bucket-name

grant connect on connector aws to grp_aws;

This sql loads the Kognitio Amazon plugin, creates a security group to allow access to it and then creates an external table connector which uses the plugin. You will need to give the connector some Amazon credentials where it says YOUR_ACCESS_KEY and YOUR_SECRET_KEY and you will need to point it at a particular storage bucket. If you want to have multiple storage buckets or use multiple sets of credentials then create multiple connectors and grant permission on different ones to appropriate sets of users. Granting the ‘connect’ permission on a connector allows users to make external tables through it. In this case you can just add them to the group grp_aws which has this.

max_connectors_per_node is needed here because the amazon connector gives out of memory errors if you try to run too many instances of it in parallel on each node.

Now an external table can be created in exactly the same way as in the HDFS example. If my amazon bucket contains a file called test.csv with 6 int columns in it I can say:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from AWS target 'file test.csv';

Creating memory images

Once you have external tables defined your server is ready to start running queries, but each time you query an object the server will go out to the remote data and pull it into the server. Kognitio is capable of running like this but most people prefer to create memory images and query those instead because this allows data to be queried very fast. There are several different kinds of memory image in Kognitio but the most commonly used images are view images. With a view image the user defines a view in the normal SQL way and then they image it, which makes an in-memory snapshot of the query. This can be done with this SQL:

create view testv as select * from test;
create view image testv;

So testv is now a memory image. Images can be created with various different memory distributions which tell the server which nodes will store which rows. The most common of these are:

  • Hashed — A hash function on some of the columns determines which nodes get which rows
  • Replicated — Every row goes to every ram processing task
  • Random — Just put the rows anywhere. This what we will get in the example above.

The various memory distributions can be used to help optimise queries. The server will move rows about automatically if they aren’t distributed correctly but placing rows so they are co-located with certain other rows can improve performance. As a general rule:

  • Small tables (under 100M in size) work best replicated
  • For everything else hash on the primary key except
  • For the biggest images which join to non-replicated tables hash on the foreign key to the biggest of the foreign tables
  • Use random if it isn’t obvious what else to use

And the syntax for these is:

create view image test replicated;
create view image test hashed(column, column, column);
create view image test random;

Imaging a view which queries one or more external tables will pull data from the external table connector straight into RAM without needing to put any of it in the Kognitio internal storage. Once the images are built you are ready to start running SQL queries against them.

How HelloFresh embraced Hadoop


Posted By : admin Comments are off
how HelloFresh embraced Hadoop
Categories :#AnalyticsNews

As businesses grow, it becomes more critical for them to have a solution that will effectively handle the increasing amounts of data they generate. However, one problem that many organisations find when they are expanding is that tools that were adequate when they were developed are not able to scale along with the company.

This was the problem facing Berlin-based home meal delivery firm HelloFresh. The five-year-old firm has expanded rapidly and now delivers more than 7.5 million meals a month to 800,000 subscribers in multiple countries. Therefore, it found itself quickly outgrowing the custom-made business intelligence system it had long relied on, and needed a new solution.

In a recent interview with InformationWeek, chief technology officer at the company Nuno Simaria explained how the company had been using a home-built business intelligence system based around PHP, using a mix of a relational database and key value storage for pre-calculated data. However, as the business grew, the limitations of this became clear.

One problem was it did not offer the flexibility or detail analysts needed. While it could track essential KPIs to provide details of what was happening within the business, it was unable to offer insight into the reasons behind any changes.

"It was definitely not a good idea, but at the time it was the technology we were most comfortable with," Mr Simaria said.

The system was also approaching the limits of its capacity, so it became obvious a change was required. The company looked at several options that would offer improved big data analytics performance, including MemSQL and SAP HANA, but ultimately, it was Apache Hadoop that won out.

Part of the reason for this was its low cost compared with competitors. Because the tools can offer high performance even on inexpensive commodity hardware, there was no need for HelloFresh to upgrade these areas. This made Hadoop a highly attractive option, even though the company's team did not have much familiarity with the technology.

This led to its own challenges. Mr Simaria explained that finding skilled engineers in the market was very difficult. Therefore, the firm's approach was to give two of its existing staff the time and resources they needed to learn about the tools.

"We'll give you the budget, and we'll give you the time," he said. "This is something we've done with other technologies as well. If it is not easy for us to access talent in the market in the short term, we will empower our developers and our engineers who are interested in problem solving, and we will let them discover the complexities of that technology."

At the end of this process, the engineers had to answer three questions: is Hadoop the right technology; how can the firm migrate existing resources to it; and what distribution should be used moving forward?

The result of the Hadoop deployment is that HelloFresh now has much faster insight into goings-on within the businesses, and is also able to delve much deeper into its data in order to uncover insight.

Mr Simaria said: "This technology has allowed us to spread data-driven decision-making to anyone in the organisation, from local teams to global finance to whoever needs to use data insights to make decisions."

How Tesco is diving into the data lake


Posted By : admin Comments are off
tesco data lake, big data, forecasting
Categories :#AnalyticsNews

An effective big data analytics solution is now an essential requirement for any large business that wishes to be successful in today's competitive environment, regardless of what sector they are in.

However, one part of the economy that particularly stands to benefit from this technology is retail. These firms have a longstanding tradition of gathering and utilising customer data, so the ability to gain greater insight from the information they already have will play a key role in their decision-making.

One company that has always been at the forefront of this is UK supermarket Tesco. It was noted by Forbes that the company was one of the first brands to track customer activity through the use of its loyalty cards, which allows it to perform activities such as delivering personalised offers.

Now, however, it is turning to technologies such as real-time analytics and the Internet of Things in order to keep up with newer competitors such as Amazon, which is moving into the grocery business.

Vidya Laxman, head of global warehouse and analytics at the supermarket, told the publication: "We are focused on data now and realise that to get where we want to be in five years' time, we have to find out what we will need now and create the right infrastructure."

She added that Tesco is focusing on technologies such as Hadoop, which is central to the 'data lake' model that the company is working towards. This will be a centralised, cloud based repository for all of the company's data, designed to be accessible and useable by any part of the organisation whenever it is needed. 

Ms Laxman explained one challenge for the company has been ensuring that the right data gets to where it needs to go, as different departments often need different information. For example, finance teams need details on sales and forecasts, while the customer side of the business needs data that can be used to inform marketing campaigns.

"We have data scientists in all of our organisations who need access to the data," she said. "That's where Hadoop comes into the picture. We've just started on this journey – we've had data warehousing for some time so there are some legacy systems present and we want to leverage what’s good and see where we can convert to using new strategies."

A key priority for Tesco's activities will be to increase the speed of data processing in order to better support activities such as real-time modelling and forecasting.

Under a traditional way of working, it may take nine or ten months just to ingest the relevant data. Therefore, improving these processes will be essential to the success of big data initiatives.

Another factor helping Tesco is an increasing reliance on open source solutions. Mike Moss, head of forecasting and analytics at Tesco, told Forbes that when he began developing his first forecasting system for the company eight years ago, any use of open source required a lengthy approval process to get it signed off.

"There wasn't the trust there in the software," he said. "It now feels like we're in a very different place than previously … Now we have freedom and all the engineers can use what they need to use, as long as it's reasonable and it makes sense."

IoT and cloud ‘the future of Hadoop’


Posted By : admin Comments are off
Iot, cloud storage, hadoop, big data
Categories :#AnalyticsNews

The creator of Hadoop, Doug Cutting, has said that cloud computing and Internet of Things (IoT) applications will be the basis for the next phase of growth for the platform.

So far, most deployments of the big data analytics tool have been in large organisations in sectors such as finance, telecommunications and internet sectors, but this is changing as more use cases emerge for the technology.

Much of this is down to the growing use of digitally-connected sensors in almost all industries, which are generating huge amounts of data that businesses will need to quickly interpret if they are to make the most of the information available to them.

Mr Cutting highlighted several major companies that have already adopted HAdoop to help them handle this huge influx of sensor data.

"Caterpillar collects data from all of its machines," he said. "Tesla is able to gather more information than anyone else in the self-driving business, they're collecting information on actual road conditions, because they have cars sending all the data back. And Airbus is loading all their sensor data from planes into Hadoop, to understand and optimise their processes."

One sector that is on the verge of a revolution in how it manages information is the automotive industry, as a growing number of cars are being equipped with IoT sensors and networking capabilities.

Mr Cutting noted that almost every new car now sold has a cellular modem installed, while almost half of new cellular devices are not phones, but other connected items.

Until now, Hadoop has often been deployed as a key component of a 'data lake', where businesses pool all their incoming data into a single, centralised resource they can dip into in order to perform analytics. However, use cases for IoT typically have a need for data to be exchanged rapidly between end-devices and the central repository.

Therefore, there has been a focus recently on the development of new tools to facilitate this faster exchange of information, such as Flume and Kafka.

Mr Cutting particularly highlighted Apache Kudu as having a key role to play in this. He said: "What Kudu lets you do is update things in real-time. It's possible to do these things using HDFS but it's much more convenient to use Kudu if you're trying to model the current state of the world."

He also noted that while the majority of Hadoop applications are currently on-premises, cloud deployments are growing twice as fast, so it will be vital that providers can deliver ways to embrace this technology in their offerings.

"We are spending a lot of time on making our offerings work well in the cloud," Mr Cutting continued. "We're trying to provide really powerful high-level tools to make the lives of those delivering this tech a lot easier."

Salaries on the rise for big data professionals


Posted By : admin Comments are off
Big data skills are in demand  Image: iStockphoto/cifotart
Categories :#AnalyticsNews

IT professionals specialising in big data are benefiting from growth in pay as employers show more demand for their skills, research has revealed.

In its latest Tech Cities Job Watch report, IT resourcing firm Experis revealed that average salaries for people with big data expertise have risen by almost eight per cent in a year.

That's nearly three per cent higher than the Bank of England's projected three per cent pay increase for the whole of Britain.

Experis' research is based on over 60,500 jobs advertised across five key tech disciplines: big data, cloud, IT security, mobile and web development.

The latest figures showed 5,148 big data jobs available in the first quarter of 2016, 87 per cent of which were based in London.

One of the key factors in the recent growth in this sector is the rising importance of personal data for businesses that want to improve their customer understanding and predict forthcoming trends.

Many companies are also bringing big data and compliance skills in-house to ensure they stay in line with new EU data protection regulations.

Geoff Smith, managing director at Experis, said big data will continue to be a "major driver" of UK economic growth as the digital revolution gathers pace.

"Yet, many companies have been slow to react and there's a limited talent pool to choose from," he added.

"Employers are willing to pay highly competitive salaries to attract these experts, so they can help with compliance, uncover valuable customer insights that can transform their business and innovate for the future."

Big data and the Internet of Things are set to add £322 billion to the UK economy within the next four years, according to a recent report from the Centre for Economics and Business Research and software provider SAS.

Telcos turning to Hadoop to help counter fraud


Posted By : admin Comments are off
Telcos, Hadoop, counter fraud, big data
Categories :#AnalyticsNews

Many businesses in the telecommunications sector are turning to Hadoop-based big data analytics solutions in order to tackle fraud, a new survey has found.

Research by Cloudera and Argyle Data noted that fraudulent activities are one of the biggest challenges for the industry, with telcos in the US alone losing around $38 billion a year in revenue to this.

Therefore, any technologies that these enterprises can put in place to help identify suspicious activity and put a stop to it before it becomes a major issue will be hugely valuable, and increasingly, Hadoop is seen as the answer.

Nine out of ten telcos (90 per cent) attending a recent webinar organised by Cloudera and Argyle Data stated they intend to use Hadoop to assist in their fraud prevention strategies.

However, just a third (34 per cent) said they currently have a platform in place for this, which indicates there is still a long way to go for the sector as a whole as they try to identify the best use cases for the technology.

Vijay Raja, solutions marketing manager at Cloudera, noted: "Fraud prevention is a textbook use case for Hadoop-based analytics because the ROI is immediately visible.  Real-time machine learning relies on large amounts of data to detect sophisticated revenue threats."

Platforms that are able to combine real-time analytics, machine learning and graphical visibility tools are essential in countering telecoms fraud. These solutions enable analysts to spot fraud attempts as they happen, which traditional monitoring systems can struggle to achieve.

As today's sophisticated, high volume attacks can cost communication service providers millions of dollars in revenue in a matter of minutes, being able to detect fraud quickly will be essential.

Arshak Navruzyan, vice-president of product management at Argyle Data, said: "Unsupervised machine learning delivers everything telco fraud analysts need to be efficient at and deliver immediate ROI."