Strata + Hadoop World – San Jose


Posted By : Sharon Kirkham Comments are off
Categories :#AnalyticsNews, Blog

The Kognitio team had a great trip to Strata + Hadoop World in San Jose last week and we would like to say a big thank you to everyone who stopped by for a chat about getting enterprise level performance for their SQL on Hadoop. We look forwarding to hearing from you when you try out Kognitio on Hadoop.

At the start of the conference we released our benchmarking whitepaper in which Kognitio outperformed Impala and Spark in a TPC-DS benchmarking exercise. This proved to be of great interest and kept us all really busy on the stand. Conversations ranged from people who have been using Hadoop a while and are having problems serving data to their end-user applications such as Tableau and Qliksense right through to those that are just starting out on their Hadoop journey and wanted to understand what Kognitio can bring to their solution stack.

The subject matter of the conference sessions indicates that there is a period of consolidation going on within the Apache® Hadoop® solution stack. Most topics were discussing how to get the most from more established projects and the challenges of enterprise adoption. There was very little new research presented which was a bit disappointing.


Marcel Kornacker and Mostafa Mokhtar from Cloudera presented a talk on optimising Impala performance that was really interesting. They had also been using the TPC-DS query set for benchmarking but obviously had to use a cut down version of the query set (75 out of 99 queries). The optimisation details will be useful for us to follow for Impala when we do the next round of benchmarking after Kognitio 8.2 is released in April. Their benchmarks were at the 1 TB and 10TB scale. Increasing scale to 10TB and concurrency above 10 streams is something that we would definitely like to do during the next set of benchmarks.

From a maths perspective it was great to see Bayesian inference in the data science mix. Michael Lee Williams from Fast Forward Labs presented a great overview. I will certainly be checking out some of algorithms and tools with a view to parallelising them within Kognitio’s external scripting framework.

Data streaming also continues to be at the forefront of the conference . It was clear from the number of sessions in the conference that more companies (such as Capital One) have experiences they want to share as well as plenty of contributions from established technology leaders such as Confluent. It is certainly something that we are thinking about here.

If you didn’t make it to our booth at San Jose we hope to see you at one of these upcoming events:

DWS17, Munich, Sponsor, Big Data

We’ll be on Booth #1003.

See us at the next Strata Data Conference in London

23-25 May 2017

Booth #511.




Posted By : admin Comments are off
sql on hadoop benchmark tests
Categories :Benchmarks, Blog
Following on from my previous blog on 12th Oct 2016. We have recently upgraded our CDH from 5.7.1 to 5.8.2 and have been concentrating on getting the TPC-DS benchmarks up and running for Spark.

Spark 1.6 comes with the CDH distribution but we also installed the Spark 2.0 beta available from the Cloudera.

In the previous blog we saw there had been a marked improvement in SQL maturity with Spark 2.0 (compared to 1.6) now supporting the SQL syntax of all 99 TPC-DS queries when functionally tested on a small 1GB data set. Therefore we are going to concentrate on Spark 2.0 performance here.


Challenges of running Spark 2.0 over 1TB

Getting the full TPC-DS benchmark to run on our 12 node cluster at the 1TB scale has proved challenging. Initially we went with running Spark 2.0 with the default “out of the box” configuration. This resulted in many queries failing to complete due to out of memory errors. We made a single configuration change increasing the spark.yarn.executor.memoryOverhead setting from the default 384MB to 1GB. This resulted in most of the queries executing although we still had 11 that would not complete. The results (seen on the right) clearly show Spark 2.0 is significantly slower than both Kognitio on Hadoop and Impala version 2.6 for the TPC-DS queries.

We must acknowledge that we need to configure Spark 2.0 further so that we can ensure more of the queries complete over 1TB within our cluster but this is an in-depth tuning exercise that needs to be carried out for our specific hardware configuration. This is made clear in similar TPC-DS benchmarks carried out by other researchers such as IBM fellow Berni Schiefer (click here).



TPC-DS benchmarks for Kognitio on Hadoop, Impala and Spark

Spark is certainly some way off the “deploy and go” experience you see with Kognitio which needs no configuration other than resource allocation via YARN and a one-off creation of memory images of the data. Assuming that we will get similar improvements when Spark 2.0 is deeply tuned the results are still going to be significantly slower than Kognitio and Impala.

While at the Spark Summit Europe this month I attended a talk given by Berni who has recently run the same TPC-DS benchmark over larger data (100TB) on a high performance system. It was interesting that some of the problematic queries in his benchmark are those that do not complete for Spark 2.0 on our system. This suggests that we are seeing comparable performance over our smaller system with similar processing issues. We plan to work through some of deep tuning he outlined and see the impact on our benchmark.

We are also currently carrying out the more realistic enterprise scenario of running multiple query stream performance benchmarks. This will be the subject of our next benchmark blog so watch this space.



Posted By : admin 1 Comment
kognitio benchmark tests
At the recent Strata conference in New York we received a lot of interest in the informal benchmarking we have been carrying out that compares Kognitio on Hadoop to some other SQL on Hadoop technologies. We have decided to formalise the benchmarking process by producing a paper detailing our testing and results. In the meantime, we will be releasing intermediate results in this blog. Preliminary results show Kognitio comes out top on SQL support and single query performance is significantly faster than Impala. Read on for more details.

It is clear from recent conversations that many organisations have issues using the tools in the standard Hadoop distributions to support enterprise level SQL on data in Hadoop. This is caused by a number of issues including:

  • SQL maturity – some products cannot handle all the SQL generated by developers and/or third party tools. They either do not support the SQL, or produce very poor query plans
  • Query performance – queries that are supported perform poorly even under single user workload
  • Concurrency – products cannot handle concurrent mixed workload well in terms of performance and give errors when under load

Bearing in mind the types of workload we have been discussing (primarily BI and complex analytics) we decided to initially concentrate on the TPC-DS benchmark. This is a well-respected, widely used query set that is representative of the type of query that seems to be most problematic. The TPC framework is also designed for benchmarking concurrent workloads.

Currently we are testing against Hive, Impala and SparkSQL as delivered in Cloudera 5.7.1 using a 12 node cluster. We will shortly be upgrading our test cluster to the most recent release of Cloudera before running the main benchmarks for the paper. We have also done some initial testing of SparkSQL 2.0 on a small HortonWorks cluster and plan to be including the Cloudera beta of SparkSQL 2.0 in the performance tests.

SQL Maturity

A common theme we’ve heard is that one of the major pain points in Hadoop adoption is the need to migrate existing SQL workloads to work on data in Hadoop. With this in mind we initially looked at the breadth of SQL that each product will execute before moving onto performance. We have categorised each of the 99 TPC-DS queries as follows

  • Runs “out of the box” (no changes needed)
  • Minor syntax changes – such as removing reserved words or “grammatical” changes
  • Long running – SQL compiles but query doesn’t come back within 1 hour
  • Syntax not currently supported

If a query requires major changes to run, it is considered not supported (see the TPC-DS documentation).

Technology Out of the Box Minor Changes Long Running Not Supported
Kognitio on Hadoop 76 23
Hive 1 30 8 6 55
Impala 55 18 2 24
Spark 1.6 39 12 3 43
Spark 2.0 2 72 25 1 1

The above table shows that many products have a long way to go and the step change in SQL supported in Spark 2.0 (from 1.6) shows the developers have recognised this. Kognitio and other technologies that are making the move from the analytical DWH space are at a distinct advantage here as they already possess the mature SQL capability required for enterprise level support.

Query Performance

The results shown right are for a single stream executing over 1TB of data but our goal is to look at concurrent mixed workloads typically found in enterprise applications.

As well as supporting all 99 queries (23 with small syntax changes) initial results for a single query stream show Kognitio is very performant compared to Impala. Kognitio runs 89 out of the 99 queries in under a minute whereas only 58 queries run in under a minute on Impala. However we recognise the real test comes in increasing the number of streams so watch this space as we increase concurrency and add Hive and Spark timings too.

sql on hadoop benchmark tests

A bit about how we run the tests

We’ve developed a benchmarking toolkit based around the TPC framework which can be used to easily test concurrent query sets across technologies on Hadoop platforms. We designed this modular toolkit to allow testers to develop their own benchmark test and are planning to make the toolkit available on github in the coming weeks once we have finished some “How to Use” documentation.

In progress and to come

As I write this we are still looking at a few interim results presented here:

1. Need to complete syntax changes for Hive so these figures may change in the final paper

2. The single query that is not supported by Spark 2.0 did execute but a Cartesian join was used leading to incorrect results.

We are planning to move on to full concurrent workloads in the next week and will publish these and the toolkit soon.

Hadoop and Spark ‘key big data platforms’ in the UK


Posted By : admin Comments are off
Hadoop and Spark 'key big data platforms' in the UK
Categories :#AnalyticsNews

Hadoop continues to lead the way as the preferred big data analytics platform for organisations in the UK, but Spark in starting to make inroads into its dominance.

This is according to recent research by Computing magazine, which found almost six out of ten respondents (59 per cent) believed their company will be using Hadoop as its primary analytical tool in 18 months' time.

This compares to 17 per cent who named Spark as the way forward for their business, while Kinesis (seven per cent), Storm (four per cent) and Flink (two per cent) received lower levels of interest. One in four IT professionals stated their business would be using another solution for big data processing.

However, the research found that more advanced organisations – described as those businesses that are leading the way when it comes to adopting and using technology to drive change – were more likely to favour Spark over Hadoop, suggesting that it is catching up.

Computing did offer a note of caution, observing that many businesses use both Spark and Hadoop in conjunction with one another, so it may well be the case that even as Spark interest goes, Hadoop is unlikely to be replaced any time soon. However, for the purposes of the survey, respondents were asked to choose only one processing platform, in order to see which are having the most impact on professionals' thinking.

Interviews conducted by Computing also saw Spark come up frequently, with the speed of the solution a commonly cited benefit. One chief technology officer noted that although it is much easier to find people with experience and understanding of Hadoop, tools such as Spark and Storm are "much more attractive and faster".

As the capabilities of Spark have grown, it has also become more attractive to companies with needs for both batch and real-time processing. One data scientist Computing spoke to noted that if users are looking to deploy new solutions, they will increasingly turn straight to Spark, rather than use tools such as MapReduce.

How can you give your big data the Spark it needs?


Posted By : Paul Groom Comments are off
Categories :Guides

big data, spark

For many firms, one of the biggest challenges when they are implementing big data analytics initiatives is dealing with the vast amount of information they collect in a timely manner.

Getting quick results is essential to the success of such a project. With the most advanced users of the technology able to gain real-time insights into the goings-on within their business and in the wider market, enterprises that lack these capabilities will struggle to compete. While the most alert companies can spot potential opportunities even before they emerge, they may have already passed-by by the time a slower business’ analytics have even noticed an opportunity.

So what can companies do to ensure they are not falling behind with their big data? In many cases, the speed of their analytics is limited by the infrastructure they have in place. But there are a growing number of solutions now available that can address these issues.

Spark and more

One of the most-hyped of these technologies is Apache Spark. This is open-source software that many are touting as a potential replacement for Hadoop. Its key features are much faster data processing speeds – claimed to be up to ten times faster on disk than Hadoop map reduce, or 100 times faster for in-memory operations.

In today’s demanding environment, this speed difference could be vital. With optional features for SQL, real-time stream processing and machine learning that promise far more than what generic Hadoop is capable of, these integrated components could be the key to quickly unlocking the potential of a firm’s data.

However, it shouldn’t be assumed that Spark is the only option available for companies looking to boost their data operations. There are a range of in-memory platforms (Kognitio being one!) and open-source platforms available to help with tasks like analytics and real-time processing, such as Apache Flink. And Hadoop itself should not be neglected: tech like Spark should not be seen as a direct replacement for this until its feature set matures, as they do not perform exactly the same tasks and can – and often should – be deployed together as part of a comprehensive big data solution.

Is your big data fast enough?

It’s also important to remember that no two businesses are alike, so not every firm will benefit from the tech in the same way. When deciding if Spark or analytical platforms like it are for you, there are several factors that need to be considered.

For starters, businesses need to determine how important speedy results are to them. If they have a need for timely or real-time results – for instance as part of a dynamic pricing strategy or if they need to monitor financial transactions for fraud – then the speed provided by Spark and it’s like will be essential.

As technology such as the Internet of Things becomes more commonplace in businesses across many industries, the speed provided by Spark and others will be beneficial. If companies are having to deal with a constant stream of incoming data from sensors, they will need an ability to deal with this quickly and continuously.

Giving your big data a boost

Turning to new technologies such as Spark or Flink can really help improve the speed, flexibility and efficiency of a Hadoop deployment. One of the key reasons for this is the fact that they take full advantage of in-memory technology.

In traditional analytics tools, information is stored, read-from and written-to physical storage solutions like hard disk drives during the process – map reduce will do this many times for a given job. This is typically one of the biggest bottlenecks in the processing operation and therefore a key cause of slow, poorly-performing analytics.

However, technologies such as Spark conduct the majority of their tasks in-memory – copying the information in much faster RAM and keeping it there as much as possible, where it can be accessed instantaneously. As the cost of memory continues to fall, these powerful capabilities are now within much easier reach of many businesses and at a scale not previously thought possible.

What skills will your organisation need to make big data work?


Posted By : Paul Groom Comments are off
Categories :Guides

talent, skilled, technicians, big data work success

Successfully deploying a big data analytics programme can bring many benefits to a business, from better productivity and a clearer insight into customers, to higher revenue and reduced time to market for new products. But putting in place effective hardware and software solutions to achieve this will only be half the task.

The real benefits of big data can only be realised if companies have the appropriately skilled staff on board to manage such operations. These personnel will not only be essential in initiating and building an advanced big data solution, but ensuring that the right questions are being asked of it and the answers are being fed back to the people best placed to act on them.

A fight for talent

As the importance of skilled personnel is being increasingly widely recognised, competition for the best talent is fierce, particularly as individuals with the right combination of expertise and experience are proving hard to come by.

A 2014 report by Nesta and the Royal Statistical Society found 80 per cent of UK businesses said it is difficult to find the skills they need to meet growing demand. Typically, those with the right range of talents will be expensive to hire, while upskilling existing staff can be a challenging and time-consuming process.

This has come about as a result of the new generation of data analytics solutions, which are leading to organisations moving beyond traditional business intelligence activities. Today’s environment calls for solutions that are able to accurately predict future demand and customer behaviour, rather than just reacting to previous events. This will require professionals who are able to think outside the box and understand how to set up and engage with big data solutions to gain valuable and relevant answers.

The skills needed for success

The pinnacle of this will be the appointment of data scientists – experts with the right mix of understanding data, coding skills, statistical knowledge and a thorough understanding of the wider business environment. However, getting to this high value point can be a lengthy process and one that requires significant investment of resources.

Whether big data talent is being brought in or nurtured from within a company, there are a new skills that will be essential to success. According to a 2014 survey from recruitment company Dice, experience with NoSQL (Not-only-SQL) is the area in highest demand, with interest rising by more than 50 per cent on the previous year.

An understanding of key Hadoop components such as MapReduce, HDFS, YARN and Spark will be necessary for businesses to succeed in their big data implementation. Hadoop can be tricky to use due to a steep learning curve, but as it will have a central role to play in any big data applications, a strong grounding in the technology will be hugely advantageous.

As well as experience in programming for general-purposes languages such as Java, Python or C++, personnel with expertise in statistical and quantitative analysis tools like R, SAS and Stata will have a key role to play in successful initiatives.

Beyond the technology

Even if companies are able to find staff with all the coding and technical skills needed to operate a big data solution, it will not be much use if these professionals are isolated away in an IT department, where they have little idea or exposure of how their work will affect other parts of the organisations.

That’s why it’s important that businesses do not overlook so-called ‘soft’ skills when considering what will be required, such as a strong ability to ask questions, communicate and understand the business value and implications of any decision.

One of the best things organisations can do to support this and ensure they have the skills they need on hand is to create a dedicated big data team. The development of a big data competency centre – which may be on a departmental or enterprise-wide level – ensures that talent is placed in a centralised location, making it an integral part of a company’s decision-making, rather than an afterthought. Bringing skills together fosters innovation and peer review for improved creativity and faster delivery.

This also helps strengthen corporate attitudes to big data. By keeping the technology and capability visible across all parts of an organisation, it becomes an intrinsic part of a company’s culture. In the long term, this is likely to prove very beneficial, as integrating data analytics into every aspect of operations will help give all staff a better understanding of what the business is capable of and what they need to do to empower beneficial change.

Why is Spark on the rise? Does it meet business needs?


Posted By : Paul Groom Comments are off
Categories :Blog

With Strata + Hadoop World soon upon us it is clear that Spark is the hot topic – see our #AnalyticsNews article

But Why? What does it represent?

I would posit that this is driven by business wanting a lot more from Hadoop… essentially a lot more throughput. Whether it be better concurrency or reduced latency or better job efficiency business is seeing value and thus demanding more use. Hooray say the engineers! More APIs to play with and code to develop. But this drive will be tempered by business impatience – “Sorry? How long to code that analytical model efficiently so it does not dominate the cluster resources?”

Business managers mentally-extrapolate new tech to commodity in a thought. They just want to switch on and use the analytics capability on their swelling lake of data. They just want to plug in standard tools and run studies…”What? I have to wait for you to develop this?”

Spark is young and developing and will become a formidable force…in time. I liken it to a box of Lego(TM). The developer sees this


The business user wants this


and this


The business manager wants pre-packaged; ready to use out of the box; flexible; scaleable; highly efficient; low latency; high throughput; engineered for analytics? It’s so much easier for the business if they can just switch on and use. Sorry Hadoop developers, there’s an easier way. Get to know Kognitio.

Paul Groom, COO

Data Science and Big Data Challenges


Posted By : Paul Groom Comments are off
Data Science, Big Data challenges
Categories :#AnalyticsNews, Blog

71 percent of data scientists believe their jobs have grown more difficult

Interesting study and observations on Data Scientists suffering from increasing difficulty with data, systems and parallelism for analytics. But this is partly brought on by trying to self-build from a diverse set of components that do not always fit particularly well or at least need more than a good nudge to click together. Analytics has to move to parallelism to cope with increasing volume and variety and keep pace with velocity of business needs – but the parallelism in Hadoop and Spark etc. is not refined or complete enough to easily support complex analytics.

Paradigm4 also found that 35 percent of data scientists who tried Hadoop or Spark have stopped using it.

Kognitio has experience in helping businesses get over these hurdles with world leading MPP analytical platform technology that intergrates directly with Hadoop and other big data stores, and with Analytical Services that help business teams solve requirements without a lot of technical fuss – Analytics for all at scale today.