Whitepapers

Hive LLAP and Kognitio benchmarking using TPC-DS query set

SQL on Apache Hadoop benchmarks – Apache Hive LLAP and Kognitio 8.2 – tests conducted in August 2017

Hadoop is moving from evaluation into production, resulting in organizations having issues obtaining enterprise-level performance from the SQL tools available in the standard Hadoop distributions.

These issues include:

  • SQL maturity – some SQL on Hadoop tools cannot handle all of the SQL generated by developers and/or third party tools. They might not support the SQL, or might produce very poor query plans
  • Query performance – for those queries that are supported, performance is poor even under a single user workload
  • Concurrency – some tools have poor concurrency performance and give errors when under load
  • Mixed workload – most end users now expect “self-serve” BI. They use tools that allow them to drill into and analyse the data in an ad hoc way.

Hive LLAP

From version 2.6 onwards, Hortonworks is shipping Apache Hive 2 LLAP as part of its standard HDP distribution to try to address these issues. LLAP enables HDP to support faster SQL performance for Hive Map Reduce (hereafter referred to as Hive MR).

We aimed to include Hive MR in a previous benchmarking whitepaper, but lack of SQL support and poor single thread performance meant it was removed from the testing.

With the release of Hive LLAP, we’ve now been able to include it in the TPC-DS benchmarking.

Kognitio

The latest release of Kognitio on Hadoop (version 8.2) is now available for download. This is a newer version of Kognitio since our previous benchmark tests.

In this whitepaper, we have performed a direct comparison of Kognitio against Hive LLAP using the full 99 query TPC-DS benchmark query set.

TPC-DS benchmark is a well-respected, widely used query set that is representative of the type of queries that seem to be most problematic. The TPC framework is also designed for benchmarking concurrent workloads.

The TPC-DS query generator was used in the benchmark to emulate a mixed workload. It randomises each query (via parameter selection) and also randomises the query submission order in each concurrent stream.

The platforms included in this benchmark are:

  • Apache Hive 2 LLAP (Hive 1.2.1)
  • Kognitio (version 8.2.0)

Each platform utilized the same 9 node AWS infrastructure: 1 master, 8 workers. We deployed this using the Hortonworks Data Cloud (HDC) available in Amazon Marketplace.

Hortonworks Data Cloud screenshot

By choosing the EDW:Analytics deployment option that comes with Apache Hive 2 already set up, we did not have to work through installation and configuration of HDP. Kognitio runs on any Hadoop distribution so was simply installed onto the same HDC system.

Can the platform run the queries?

For the SQL on Hadoop deployment to be successful, the ability to migrate existing workloads to run over data in Hadoop is essential.

The breadth of SQL supported by each platform was investigated. The queries supported by Hive MR are also presented for comparison. Each of the 99 TPC-DS queries was qualified as one of the following:

  • Runs ‘out of the box’ (no changes needed)
  • Minor syntax changes – such as removing reserved words and editing column aliases
  • No support – syntax not currently supported
Platform Hive MR Hive LLAP Kognitio
Out of box 40 75 76
Minor changes 21 19 23
No support 38  5

The table above shows that for functional testing (over 1GB of data) Hive LLAP uses HiveServer2 interactive query via beeline using a JDBC connection and the syntax support for this compared to Hive MR is a big improvement. Hive LLAP supports all but 5 queries compared with only 61 supported for Hive MR. Kognitio can execute all 99 TPC-DS queries.

Note – Hive MR was shown here for the purpose of illustrating the improved SQL support in HiveServer2 only. It was not accessed at scale.

Can the platform perform at scale?

Running a single query stream at a 1TB scale is a starting point for assessing platform performance.

For Hive LLAP, the 1TB data set was held in Hive tables using an ORC format, with larger tables partitioned on date columns most frequently used in predicates within the queries. Bucketing was also used for join columns in the larger tables. For Kognitio, the data was held in view images with larger tables hashed on the main join columns.

For a single query stream Kognitio outperforms Hive LLAP as can be seen from the overview table below and the speed comparisons in figure 1.

Platform Hive LLAP Kognitio
Queries run 92 99
Long / error 2
No support 5
Fastest query count 11 88
Query overview – single stream at 1TB

At 1TB scale there were 2 queries that failed to complete in Hive LLAP. Logs suggest that this was due to a lack of memory resource available to complete the query. Kognitio runs all 99 queries without issue at this scale and is faster than Hive LLAP for 88 out of 99 queries.

Figure 1 Hive LLAP comparison with Kognitio
Figure 1: speed comparison single stream at 1TB

Figure 1 is a more detailed breakdown of query performance by representing the relative speed between Kognitio (blue) and Hive LLAP (yellow). Each query is represented by a horizontal block. The faster platform for a given query gets the largest proportion of the block. Therefore the more a color dominates is an indication of how much better that platform performs.

The solid blue block at the top of figure 1 represents the 7 queries that Kognitio can run but Hive LLAP does not support or errored at 1TB. Kognitio runs 81 out of the remaining 92 queries faster than Hive LLAP with 12 queries over 10x faster.

How does the platform perform under load?

To support enterprise level “self-serve” BI it is essential that SQL on Hadoop platforms can perform under mixed concurrent workloads.

The TPC-DS 1TB benchmark was run under increasing workloads up to 10 query streams as defined in the TPC-DS documentation. An overview of results is given in the table below.

Platform Hive LLAP Kognitio
Queries run 89 98
Long / error 5  1
No support 5
Fastest query count 15 83
Query overview – 10 streams at 1TB

When concurrency was increased to 10 streams, the memory resource on the edge node was insufficient to support the 10 instances of beeline running the Hive LLAP query scripts. The size of the edge node was increased to work around this. More investigation is required to resolve the utilization of memory on the edge node during the Hive LLAP benchmark.

The longest running queries (>10 mins for a single stream or >1hr for 10 streams) had to be removed from the 10 stream benchmark as these severely skewed all the concurrent performance. For Kognitio this was just a single query leaving 98 in the benchmark. For Hive LLAP, 3 long-running queries were removed leaving 89 queries in total.

From the results table it is clear that Kognitio is more performant than Hive LLAP, running 83 out of the 99 queries the fastest.

Figure 2 Hive LLAP Kognitio Comparison 10 streams
Figure 2: speed comparisons 10 streams at 1TB

Figure 2 shows the speed comparisons between Hive LLAP and Kognitio for the 10 stream benchmark. It shows that Kognitio is faster in 74 out of the 89 queries that Hive LLAP can run. There are 5 queries where Kognitio is over 10x faster. Hive LLAP is faster in 15 queries.

Conclusion

Kognitio outperforms Hive LLAP in both single and multi-stream mixed SQL workloads. Mature SQL optimization, use of machine code generation and efficient use of memory and CPU resource mean Kognitio continues to be the most performant SQL on Hadoop offering for enterprise level mixed workloads.

For this benchmark our testing was carried out on AWS instances. Our previous benchmark of Kognitio, Impala and Spark was run on in-house infrastructure. This means that the 2 sets of timings are not directly comparable. However results indicate that Hive LLAP looks to have a performance level that is similar to Impala, with Kognitio out-performing both platforms by a similar level of magnitude in each benchmark.

Technical Details

Details of the infrastructure utilized for this benchmark along with timings for individual query can be found here.

Future Testing

We will be updating the benchmarks for Impala and Spark to utilize the same AWS infrastructure and latest software versions so that direct comparisons can be made between all platforms. We also plan to include Presto.

Evaluating higher concurrency is also in the future plans. It will be investigated in 2 ways: a smaller data set (300GB) on the same system and the same 1TB data on a larger system.

Want to stay informed about future testing?

Then click the button below to subscribe to our SQL on Hadoop emails

Sign me up