Apache Hadoop is an open source software offering that supports the storage and processing ofRead More
All I want for Christmas is – an “honest” benchmark
After a busy November at Big Data London and working with various clients I’ve had a few days to catch up with some reading as I think about where to focus our analytics research in 2018. I read an interesting blog article by Dan Graham at Teradata pointing out some of the pitfalls of benchmarking SQL on Hadoop. Dan raised some interesting points.
Like Teradata, until recently Kognitio didn’t do any head-to-head benchmarking. In the traditional Data Warehouse (DWH) space, getting hold of full-feature competitive products was difficult, and running benchmarks on comparable hardware was impossible as many require non-commodity infrastructure.
The release of Kognitio on Hadoop last year moved the benchmarking goalposts for us. We are now in a market (SQL on Hadoop) where products are easily accessible and readily available for evaluation. As Dan rightly points out, running any benchmark comparison on identical hardware is very important and now we can. In fact we can go one better and run all the benchmarks on the same Hadoop system. When we embarked on our benchmarking it wasn’t because I wanted our team to spend months buried in the bowels of different Hadoop distributions and googling various configuration settings; it was simply so we could answer the most common question we ever get asked:
“How do you compare to product X though?”
Now rather than having to explain why we don’t benchmark, we can talk through our results and start conversations about our approach for getting great SQL on Hadoop performance.
Running a true benchmark
I wholeheartedly agree with Dan on being wary of modified SQL and favourable result selection in benchmarking. If you are using benchmarking papers as a first point of research when looking to choose a SQL on Hadoop solution then it is worth learning about the design of the benchmark itself. For example the TPC-DS query set is 99 queries that have been specifically designed to emulate a mixed analytic SQL workload. If run properly, it contains parameterization within queries and random ordering of queries within each concurrent stream. Any tests using TPC-DS must include all these features to be considered as a true mix workload test.
The SQL used should not be rewritten. It is one thing to edit the SQL to use double pipe (||) instead of CONCATENATE (or vice versa) or modify date function “grammar” to match the tool’s syntax, but rewriting a ROLLUP query as a series of UNION statements is hiding the fact the platform doesn’t support OLAP SQL functionality. In the interests of honesty, all this should be explicitly declared in results as every manual alteration of a query is a potential issue for any BI tool that auto-generates SQL.
Dan also indicates that altering queries to re-order tables for join optimization should not be done either. I agree but the statement, “Lacking an optimizer, Hadoop and Spark systems process tables in the order encountered in the SQL statement,” is quite simply out of date. I know at least one SQL on Hadoop solution that has this functionality (I’ll leave you to guess which one). Well actually as it is the season of good-will; many SQL on Hadoop platforms now have join order optimization. Kognitio has had it for years before we could even run on Hadoop. Impala also has join order optimization and it was introduced into Spark SQL in Version 2.2 back in August.
Climbing down from my “high horse” there really is no real substitute for running your own benchmark using your own data. To this end, having solutions that run on your existing Hadoop system means your data should be readily available to the platforms you want to evaluate. The first test can be how easy is it to install and configure the software to get it running on your system using your own data. This is an area that Kognitio has improved for our upcoming 8.2.1 release. In the new year we also plan to make our benchmarking test suite available so you can build query sets of your own that can be randomized and run concurrently (watch this space).
This is an interesting question and I’m afraid I have to disagree with Dan on this. Perhaps I might be accused of moving into “bench-marketing” here but given you have read this far bear with me…
Is any organization not going make the most of their chosen SQL on Hadoop solution by trying to optimize it for their current workload? Of course not. Although workload patterns will change over time they are not likely to take a massive step change overnight. (Note I’m not talking about peak demand or end of quarter etc. – you should always consider these in your benchmark testing as these are part of your workload pattern). Therefore I don’t think it is cheating to make sure that your system is optimized for the workload. Let’s consider an example from our benchmarks. For Kognitio we used some of the RAM available to build memory images of the data then the rest of the RAM to run queries. This is how we work and how we get performance. I don’t see it as cheating; we have the same resources as other platforms but utilize them differently and often more effectively. For Impala we optimized the data to be stored in ORC format for fast access when required. For all systems we ran stats updates because that is what you would do in production. This means that Hive LLAP cached the data for faster access – that’s how it gets better performance. How easy these optimizations are to complete and how long it takes to execute them should be considered though, and in the future we will publish these details in our benchmarks.
We are obviously experts in our own product but we genuinely do our best to get the best performance with the other platforms we test, making use of published research and other benchmarks. Fundamentally I’m a mathematician – rigor is my thing. So if you read our benchmark papers and want to run them independently or have suggestions for improving performance on Impala, Hive LLAP, Spark SQL or Presto please get in touch with me. I would love to hear from you.
p.s. If my other half reads this – I honestly do not want a benchmark for Christmas. I have been hinting for months – think man, think.
Want to hear when we release new benchmark results?
Then click the button below to subscribe to our SQL on Hadoop emails