Posted By : admin Comments are off
sql on hadoop benchmark tests
Categories :Benchmarks, Blog
Following on from my previous blog on 12th Oct 2016. We have recently upgraded our CDH from 5.7.1 to 5.8.2 and have been concentrating on getting the TPC-DS benchmarks up and running for Spark.

Spark 1.6 comes with the CDH distribution but we also installed the Spark 2.0 beta available from the Cloudera.

In the previous blog we saw there had been a marked improvement in SQL maturity with Spark 2.0 (compared to 1.6) now supporting the SQL syntax of all 99 TPC-DS queries when functionally tested on a small 1GB data set. Therefore we are going to concentrate on Spark 2.0 performance here.


Challenges of running Spark 2.0 over 1TB

Getting the full TPC-DS benchmark to run on our 12 node cluster at the 1TB scale has proved challenging. Initially we went with running Spark 2.0 with the default “out of the box” configuration. This resulted in many queries failing to complete due to out of memory errors. We made a single configuration change increasing the spark.yarn.executor.memoryOverhead setting from the default 384MB to 1GB. This resulted in most of the queries executing although we still had 11 that would not complete. The results (seen on the right) clearly show Spark 2.0 is significantly slower than both Kognitio on Hadoop and Impala version 2.6 for the TPC-DS queries.

We must acknowledge that we need to configure Spark 2.0 further so that we can ensure more of the queries complete over 1TB within our cluster but this is an in-depth tuning exercise that needs to be carried out for our specific hardware configuration. This is made clear in similar TPC-DS benchmarks carried out by other researchers such as IBM fellow Berni Schiefer (click here).



TPC-DS benchmarks for Kognitio on Hadoop, Impala and Spark

Spark is certainly some way off the “deploy and go” experience you see with Kognitio which needs no configuration other than resource allocation via YARN and a one-off creation of memory images of the data. Assuming that we will get similar improvements when Spark 2.0 is deeply tuned the results are still going to be significantly slower than Kognitio and Impala.

While at the Spark Summit Europe this month I attended a talk given by Berni who has recently run the same TPC-DS benchmark over larger data (100TB) on a high performance system. It was interesting that some of the problematic queries in his benchmark are those that do not complete for Spark 2.0 on our system. This suggests that we are seeing comparable performance over our smaller system with similar processing issues. We plan to work through some of deep tuning he outlined and see the impact on our benchmark.

We are also currently carrying out the more realistic enterprise scenario of running multiple query stream performance benchmarks. This will be the subject of our next benchmark blog so watch this space.



Posted By : admin 1 Comment
kognitio benchmark tests
At the recent Strata conference in New York we received a lot of interest in the informal benchmarking we have been carrying out that compares Kognitio on Hadoop to some other SQL on Hadoop technologies. We have decided to formalise the benchmarking process by producing a paper detailing our testing and results. In the meantime, we will be releasing intermediate results in this blog. Preliminary results show Kognitio comes out top on SQL support and single query performance is significantly faster than Impala. Read on for more details.

It is clear from recent conversations that many organisations have issues using the tools in the standard Hadoop distributions to support enterprise level SQL on data in Hadoop. This is caused by a number of issues including:

  • SQL maturity – some products cannot handle all the SQL generated by developers and/or third party tools. They either do not support the SQL, or produce very poor query plans
  • Query performance – queries that are supported perform poorly even under single user workload
  • Concurrency – products cannot handle concurrent mixed workload well in terms of performance and give errors when under load

Bearing in mind the types of workload we have been discussing (primarily BI and complex analytics) we decided to initially concentrate on the TPC-DS benchmark. This is a well-respected, widely used query set that is representative of the type of query that seems to be most problematic. The TPC framework is also designed for benchmarking concurrent workloads.

Currently we are testing against Hive, Impala and SparkSQL as delivered in Cloudera 5.7.1 using a 12 node cluster. We will shortly be upgrading our test cluster to the most recent release of Cloudera before running the main benchmarks for the paper. We have also done some initial testing of SparkSQL 2.0 on a small HortonWorks cluster and plan to be including the Cloudera beta of SparkSQL 2.0 in the performance tests.

SQL Maturity

A common theme we’ve heard is that one of the major pain points in Hadoop adoption is the need to migrate existing SQL workloads to work on data in Hadoop. With this in mind we initially looked at the breadth of SQL that each product will execute before moving onto performance. We have categorised each of the 99 TPC-DS queries as follows

  • Runs “out of the box” (no changes needed)
  • Minor syntax changes – such as removing reserved words or “grammatical” changes
  • Long running – SQL compiles but query doesn’t come back within 1 hour
  • Syntax not currently supported

If a query requires major changes to run, it is considered not supported (see the TPC-DS documentation).

Technology Out of the Box Minor Changes Long Running Not Supported
Kognitio on Hadoop 76 23
Hive 1 30 8 6 55
Impala 55 18 2 24
Spark 1.6 39 12 3 43
Spark 2.0 2 72 25 1 1

The above table shows that many products have a long way to go and the step change in SQL supported in Spark 2.0 (from 1.6) shows the developers have recognised this. Kognitio and other technologies that are making the move from the analytical DWH space are at a distinct advantage here as they already possess the mature SQL capability required for enterprise level support.

Query Performance

The results shown right are for a single stream executing over 1TB of data but our goal is to look at concurrent mixed workloads typically found in enterprise applications.

As well as supporting all 99 queries (23 with small syntax changes) initial results for a single query stream show Kognitio is very performant compared to Impala. Kognitio runs 89 out of the 99 queries in under a minute whereas only 58 queries run in under a minute on Impala. However we recognise the real test comes in increasing the number of streams so watch this space as we increase concurrency and add Hive and Spark timings too.

sql on hadoop benchmark tests

A bit about how we run the tests

We’ve developed a benchmarking toolkit based around the TPC framework which can be used to easily test concurrent query sets across technologies on Hadoop platforms. We designed this modular toolkit to allow testers to develop their own benchmark test and are planning to make the toolkit available on github in the coming weeks once we have finished some “How to Use” documentation.

In progress and to come

As I write this we are still looking at a few interim results presented here:

1. Need to complete syntax changes for Hive so these figures may change in the final paper

2. The single query that is not supported by Spark 2.0 did execute but a Cartesian join was used leading to incorrect results.

We are planning to move on to full concurrent workloads in the next week and will publish these and the toolkit soon.