Strata + Hadoop World – San Jose

22

Mar
2017
Posted By : Sharon Kirkham Comments are off
Categories :#AnalyticsNews, Blog

The Kognitio team had a great trip to Strata + Hadoop World in San Jose last week and we would like to say a big thank you to everyone who stopped by for a chat about getting enterprise level performance for their SQL on Hadoop. We look forwarding to hearing from you when you try out Kognitio on Hadoop.

At the start of the conference we released our benchmarking whitepaper in which Kognitio outperformed Impala and Spark in a TPC-DS benchmarking exercise. This proved to be of great interest and kept us all really busy on the stand. Conversations ranged from people who have been using Hadoop a while and are having problems serving data to their end-user applications such as Tableau and Qliksense right through to those that are just starting out on their Hadoop journey and wanted to understand what Kognitio can bring to their solution stack.

The subject matter of the conference sessions indicates that there is a period of consolidation going on within the Apache® Hadoop® solution stack. Most topics were discussing how to get the most from more established projects and the challenges of enterprise adoption. There was very little new research presented which was a bit disappointing.

 

Marcel Kornacker and Mostafa Mokhtar from Cloudera presented a talk on optimising Impala performance that was really interesting. They had also been using the TPC-DS query set for benchmarking but obviously had to use a cut down version of the query set (75 out of 99 queries). The optimisation details will be useful for us to follow for Impala when we do the next round of benchmarking after Kognitio 8.2 is released in April. Their benchmarks were at the 1 TB and 10TB scale. Increasing scale to 10TB and concurrency above 10 streams is something that we would definitely like to do during the next set of benchmarks.

From a maths perspective it was great to see Bayesian inference in the data science mix. Michael Lee Williams from Fast Forward Labs presented a great overview. I will certainly be checking out some of algorithms and tools with a view to parallelising them within Kognitio’s external scripting framework.

Data streaming also continues to be at the forefront of the conference . It was clear from the number of sessions in the conference that more companies (such as Capital One) have experiences they want to share as well as plenty of contributions from established technology leaders such as Confluent. It is certainly something that we are thinking about here.

If you didn’t make it to our booth at San Jose we hope to see you at one of these upcoming events:

DWS17, Munich, Sponsor, Big Data

We’ll be on Booth #1003.

See us at the next Strata Data Conference in London

23-25 May 2017

Booth #511.

 

SQL ON HADOOP BENCHMARKING – Spark 2.0

04

Nov
2016
Posted By : admin Comments are off
sql on hadoop benchmark tests
Categories :Benchmarks, Blog
Following on from my previous blog on 12th Oct 2016. We have recently upgraded our CDH from 5.7.1 to 5.8.2 and have been concentrating on getting the TPC-DS benchmarks up and running for Spark.

Spark 1.6 comes with the CDH distribution but we also installed the Spark 2.0 beta available from the Cloudera.

In the previous blog we saw there had been a marked improvement in SQL maturity with Spark 2.0 (compared to 1.6) now supporting the SQL syntax of all 99 TPC-DS queries when functionally tested on a small 1GB data set. Therefore we are going to concentrate on Spark 2.0 performance here.

 

Challenges of running Spark 2.0 over 1TB

Getting the full TPC-DS benchmark to run on our 12 node cluster at the 1TB scale has proved challenging. Initially we went with running Spark 2.0 with the default “out of the box” configuration. This resulted in many queries failing to complete due to out of memory errors. We made a single configuration change increasing the spark.yarn.executor.memoryOverhead setting from the default 384MB to 1GB. This resulted in most of the queries executing although we still had 11 that would not complete. The results (seen on the right) clearly show Spark 2.0 is significantly slower than both Kognitio on Hadoop and Impala version 2.6 for the TPC-DS queries.

We must acknowledge that we need to configure Spark 2.0 further so that we can ensure more of the queries complete over 1TB within our cluster but this is an in-depth tuning exercise that needs to be carried out for our specific hardware configuration. This is made clear in similar TPC-DS benchmarks carried out by other researchers such as IBM fellow Berni Schiefer (click here).

 

 

TPC-DS benchmarks for Kognitio on Hadoop, Impala and Spark

Spark is certainly some way off the “deploy and go” experience you see with Kognitio which needs no configuration other than resource allocation via YARN and a one-off creation of memory images of the data. Assuming that we will get similar improvements when Spark 2.0 is deeply tuned the results are still going to be significantly slower than Kognitio and Impala.

While at the Spark Summit Europe this month I attended a talk given by Berni who has recently run the same TPC-DS benchmark over larger data (100TB) on a high performance system. It was interesting that some of the problematic queries in his benchmark are those that do not complete for Spark 2.0 on our system. This suggests that we are seeing comparable performance over our smaller system with similar processing issues. We plan to work through some of deep tuning he outlined and see the impact on our benchmark.

We are also currently carrying out the more realistic enterprise scenario of running multiple query stream performance benchmarks. This will be the subject of our next benchmark blog so watch this space.

SQL ON HADOOP BENCHMARKING

12

Oct
2016
Posted By : admin 1 Comment
kognitio benchmark tests
At the recent Strata conference in New York we received a lot of interest in the informal benchmarking we have been carrying out that compares Kognitio on Hadoop to some other SQL on Hadoop technologies. We have decided to formalise the benchmarking process by producing a paper detailing our testing and results. In the meantime, we will be releasing intermediate results in this blog. Preliminary results show Kognitio comes out top on SQL support and single query performance is significantly faster than Impala. Read on for more details.

It is clear from recent conversations that many organisations have issues using the tools in the standard Hadoop distributions to support enterprise level SQL on data in Hadoop. This is caused by a number of issues including:

  • SQL maturity – some products cannot handle all the SQL generated by developers and/or third party tools. They either do not support the SQL, or produce very poor query plans
  • Query performance – queries that are supported perform poorly even under single user workload
  • Concurrency – products cannot handle concurrent mixed workload well in terms of performance and give errors when under load

Bearing in mind the types of workload we have been discussing (primarily BI and complex analytics) we decided to initially concentrate on the TPC-DS benchmark. This is a well-respected, widely used query set that is representative of the type of query that seems to be most problematic. The TPC framework is also designed for benchmarking concurrent workloads.

Currently we are testing against Hive, Impala and SparkSQL as delivered in Cloudera 5.7.1 using a 12 node cluster. We will shortly be upgrading our test cluster to the most recent release of Cloudera before running the main benchmarks for the paper. We have also done some initial testing of SparkSQL 2.0 on a small HortonWorks cluster and plan to be including the Cloudera beta of SparkSQL 2.0 in the performance tests.

SQL Maturity

A common theme we’ve heard is that one of the major pain points in Hadoop adoption is the need to migrate existing SQL workloads to work on data in Hadoop. With this in mind we initially looked at the breadth of SQL that each product will execute before moving onto performance. We have categorised each of the 99 TPC-DS queries as follows

  • Runs “out of the box” (no changes needed)
  • Minor syntax changes – such as removing reserved words or “grammatical” changes
  • Long running – SQL compiles but query doesn’t come back within 1 hour
  • Syntax not currently supported

If a query requires major changes to run, it is considered not supported (see the TPC-DS documentation).

Technology Out of the Box Minor Changes Long Running Not Supported
Kognitio on Hadoop 76 23
Hive 1 30 8 6 55
Impala 55 18 2 24
Spark 1.6 39 12 3 43
Spark 2.0 2 72 25 1 1

The above table shows that many products have a long way to go and the step change in SQL supported in Spark 2.0 (from 1.6) shows the developers have recognised this. Kognitio and other technologies that are making the move from the analytical DWH space are at a distinct advantage here as they already possess the mature SQL capability required for enterprise level support.

Query Performance

The results shown right are for a single stream executing over 1TB of data but our goal is to look at concurrent mixed workloads typically found in enterprise applications.

As well as supporting all 99 queries (23 with small syntax changes) initial results for a single query stream show Kognitio is very performant compared to Impala. Kognitio runs 89 out of the 99 queries in under a minute whereas only 58 queries run in under a minute on Impala. However we recognise the real test comes in increasing the number of streams so watch this space as we increase concurrency and add Hive and Spark timings too.

sql on hadoop benchmark tests

A bit about how we run the tests

We’ve developed a benchmarking toolkit based around the TPC framework which can be used to easily test concurrent query sets across technologies on Hadoop platforms. We designed this modular toolkit to allow testers to develop their own benchmark test and are planning to make the toolkit available on github in the coming weeks once we have finished some “How to Use” documentation.

In progress and to come

As I write this we are still looking at a few interim results presented here:

1. Need to complete syntax changes for Hive so these figures may change in the final paper

2. The single query that is not supported by Spark 2.0 did execute but a Cartesian join was used leading to incorrect results.

We are planning to move on to full concurrent workloads in the next week and will publish these and the toolkit soon.

Facebook

Twitter

LinkedId