Spark 1.6 comes with the CDH distribution but we also installed the Spark 2.0 beta available from the Cloudera.
In the previous blog we saw there had been a marked improvement in SQL maturity with Spark 2.0 (compared to 1.6) now supporting the SQL syntax of all 99 TPC-DS queries when functionally tested on a small 1GB data set. Therefore we are going to concentrate on Spark 2.0 performance here.
Challenges of running Spark 2.0 over 1TB
Getting the full TPC-DS benchmark to run on our 12 node cluster at the 1TB scale has proved challenging. Initially we went with running Spark 2.0 with the default “out of the box” configuration. This resulted in many queries failing to complete due to out of memory errors. We made a single configuration change increasing the spark.yarn.executor.memoryOverhead setting from the default 384MB to 1GB. This resulted in most of the queries executing although we still had 11 that would not complete. The results (seen on the right) clearly show Spark 2.0 is significantly slower than both Kognitio on Hadoop and Impala version 2.6 for the TPC-DS queries.
We must acknowledge that we need to configure Spark 2.0 further so that we can ensure more of the queries complete over 1TB within our cluster but this is an in-depth tuning exercise that needs to be carried out for our specific hardware configuration. This is made clear in similar TPC-DS benchmarks carried out by other researchers such as IBM fellow Berni Schiefer (click here).
Spark is certainly some way off the “deploy and go” experience you see with Kognitio which needs no configuration other than resource allocation via YARN and a one-off creation of memory images of the data. Assuming that we will get similar improvements when Spark 2.0 is deeply tuned the results are still going to be significantly slower than Kognitio and Impala.
While at the Spark Summit Europe this month I attended a talk given by Berni who has recently run the same TPC-DS benchmark over larger data (100TB) on a high performance system. It was interesting that some of the problematic queries in his benchmark are those that do not complete for Spark 2.0 on our system. This suggests that we are seeing comparable performance over our smaller system with similar processing issues. We plan to work through some of deep tuning he outlined and see the impact on our benchmark.
We are also currently carrying out the more realistic enterprise scenario of running multiple query stream performance benchmarks. This will be the subject of our next benchmark blog so watch this space.