It is clear from recent conversations that many organisations have issues using the tools in the standard Hadoop distributions to support enterprise level SQL on data in Hadoop. This is caused by a number of issues including:
- SQL maturity – some products cannot handle all the SQL generated by developers and/or third party tools. They either do not support the SQL, or produce very poor query plans
- Query performance – queries that are supported perform poorly even under single user workload
- Concurrency – products cannot handle concurrent mixed workload well in terms of performance and give errors when under load
Bearing in mind the types of workload we have been discussing (primarily BI and complex analytics) we decided to initially concentrate on the TPC-DS benchmark. This is a well-respected, widely used query set that is representative of the type of query that seems to be most problematic. The TPC framework is also designed for benchmarking concurrent workloads.
Currently we are testing against Hive, Impala and SparkSQL as delivered in Cloudera 5.7.1 using a 12 node cluster. We will shortly be upgrading our test cluster to the most recent release of Cloudera before running the main benchmarks for the paper. We have also done some initial testing of SparkSQL 2.0 on a small HortonWorks cluster and plan to be including the Cloudera beta of SparkSQL 2.0 in the performance tests.
A common theme we’ve heard is that one of the major pain points in Hadoop adoption is the need to migrate existing SQL workloads to work on data in Hadoop. With this in mind we initially looked at the breadth of SQL that each product will execute before moving onto performance. We have categorised each of the 99 TPC-DS queries as follows
- Runs “out of the box” (no changes needed)
- Minor syntax changes – such as removing reserved words or “grammatical” changes
- Long running – SQL compiles but query doesn’t come back within 1 hour
- Syntax not currently supported
If a query requires major changes to run, it is considered not supported (see the TPC-DS documentation).
|Technology||Out of the Box||Minor Changes||Long Running||Not Supported|
|Kognitio on Hadoop||76||23|
|Spark 2.0 2||72||25||1||1|
The above table shows that many products have a long way to go and the step change in SQL supported in Spark 2.0 (from 1.6) shows the developers have recognised this. Kognitio and other technologies that are making the move from the analytical DWH space are at a distinct advantage here as they already possess the mature SQL capability required for enterprise level support.
As well as supporting all 99 queries (23 with small syntax changes) initial results for a single query stream show Kognitio is very performant compared to Impala. Kognitio runs 89 out of the 99 queries in under a minute whereas only 58 queries run in under a minute on Impala. However we recognise the real test comes in increasing the number of streams so watch this space as we increase concurrency and add Hive and Spark timings too.
A bit about how we run the tests
We’ve developed a benchmarking toolkit based around the TPC framework which can be used to easily test concurrent query sets across technologies on Hadoop platforms. We designed this modular toolkit to allow testers to develop their own benchmark test and are planning to make the toolkit available on github in the coming weeks once we have finished some “How to Use” documentation.
In progress and to come
As I write this we are still looking at a few interim results presented here:
1. Need to complete syntax changes for Hive so these figures may change in the final paper
2. The single query that is not supported by Spark 2.0 did execute but a Cartesian join was used leading to incorrect results.
We are planning to move on to full concurrent workloads in the next week and will publish these and the toolkit soon.