The DataWorks Summit in San Jose was held on June 13-15, and this blog post summarises interesting talks at the event.

Keynote section

Sumeet Singh (Yahoo)

Sumeet talked about Yahoo’s migration from MapReduce jobs to those running on Tez on the 39K+ nodes that they use for Hadoop processing with over 1M jobs per day. In the last year, MapReduce jobs have dropped from about 70% of the workload to around 10%, with Tez moving in the opposite direction (Spark job level remaining consistent). This has also allowed improved utilisation of the compute and storage infrastructure.

In addition, updating the version of Hive in use has led to a significant lowering of latency for analytics jobs. The following slide shows how the most common query runtime is now in the 5-30 second range (for about 1.2 million queries per month, out of a total of 2 million per month), although you can see how this increases with the number of records involved – on the right-hand side of the chart are the jobs which take over 5 minutes as the average number of records involved rises to around 15 billion.

Girish Mundada (HPE)

Girish had previously been a Postgres developer, and highlighted a number of lessons learned from this career:

  • databases never finish (in terms of development, not individual query runtime)
  • there a multiple ways to solve the same problem
  • use the right set of tools for the job (which seems relevant given the number of SQL on Hadoop alternatives that exist, and the possibility of deploying multiple of these solutions on your Hadoop cluster)

The real crux of his talk was explaining the complexity of Hadoop for many companies, and hence the benefit of using HPE or some other vendor (or indeed cloud infrastructure company) to simplify the process for these companies.

Hadoop Query Performance Smackdown (Michael Fagan, Comcast)

Michael and his co-speaker talked about their benchmarking of SQL on Hadoop solutions, using 66 of the TPC-DS queries and a variety of file formats.

The platform used was 11 worker nodes with 128GB RAM and 32 cores each, plus 5 master nodes with 90GB  RAM and 32 cores each. The system had HDP 2.6 installed.

They chose to impose a 10 minute penalty for any failing queries, and all the engines used in their test had failures (from 1 for Hive LLAP, to 8 for Hive on MapReduce). They had issues with the Spark Thrift Server which led to very inconsistent timings for SparkSQL and long garbage-collection pauses – their feedback on this was to wait for improvements rather than rule this out permanently based on current performance.

From their timings, LLAP came out best, just ahead of Presto for the SQL engines (the latter having issues with date-related casting which was a big part of its 5 failing queries). Of the 66 queries, LLAP was fastest for 44, Presto for 16, and Tez for 6. They viewed LLAP and Presto as solid, with no major issues in their 3 months of testing.

On file formats, Bzip compressed text and sequence files performed the worst, which was a caveat against just dumping existing data into Hadoop using these formats. ORC Zlib was the benchmark winner, just ahead of Parquet.

In response to questions from the audience, the lack of concurrency in the tests was mentioned, as was the subset of TPC-DS queries run. Kognitio’s own benchmarking using TPC-DS queries did use concurrency, and did use all the queries – more information can be found at

Tez Shuffle Handler: Shuffling At Scale with Apache Hadoop (John Eagles, Yahoo)

John used a Yahoo quote early in his talk, “When you run at scale, there are no corner cases”. He then discussed a number of rare performance issues seen as Yahoo increased adoption of Tez internally, where the sophistication in Tez had outgrown the MapReduce shuffle handler.

For example, the slide on the right shows a problem with auto-reduce when an operation was expected to use 999 reducers but ended up with just 1. This ended up having to retrieve data from 999 partitions on each of 4300 mappers, for a meagre total of 450MB. Due to the number of operations, this shuffle took 20 minutes. So Yahoo introduced a composite or ranged fetch to allow multiple partitions to be retrieved in one operation, reducing the shuffle time to 60 seconds.

Similar issues were also seen with e.g. an auto-reduce merge – this time the composite fetch only sped up the operation from 50 minutes to 20 minutes as the real problem was an inefficiency in the way merges with a massive number of inputs (17 million for the query in question) were handled, and fixing this reduced the shuffle time to 90 seconds.

ORC File – Optimizing Your Big Data (Owen O’Malley, Hortonworks)

Owen discussed the importance of stripe size with the default being 64MB. The setting is a trade-off with larger stripes giving large, more efficient reads, but smaller stripes requiring less memory and giving more granular processing splits. When writing multiple files concurrently the strip size is automatically shrunk, but sorting dynamic partitions means only one writer is active at a time.

He also covered HDFS block padding settings to align stripes with HDFS blocks, which gives a performance win at the cost of some storage inefficiency.

Predicate push down was covered, allowing parts of files to be skipped which cannot contain valid rows. ORC indexes at the file, stripe and row group (10K rows) level, to allow push down at various granularities.

Sorting data within a file rather than creating lots of partitions allows row pruning, and bloom filters can be used to improve scan performance. Again, there is a trade-off between the space used for these filters and their efficiency, which can be controlled via a parameter. A good example of row pruning occurs in TPC-DS with a literal predicate on the line item table – using no filters, 6M rows are read, with just the min/max metadata this is reduced to 540K rows, and with bloom filters it drops to 10K rows.

Column encryption is supported, allowing some columns of a file to be encrypted (both data and index). The user can specify how data is treated when the user does not have access (nullify / redact / SHA256).

An Overview On Optimization In Apache Hive: Past, Present, Future (Hortonworks)

This talk mentioned the use of multiple execution engines (Tez, Spark), vectorized query execution to integrate with columnar storage (ORC, Parquet), LLAP for low latency queries.

It covered the need for a query optimizer, and the challenge between plan generation latency and optimality. From 0.14.0, Hive has used Calcite for its logical optimiser, and gradually shifted logic from Hive to Calcite. The slide on the right shows some of the optimizer improvements made in this period.

For logical optimization there are rule-based and cost-based phases, with over 40 different rewriting rules (including pushdown filter predicates, pushdown project expressions, inference or new filter predicates, expression simplification, …). The rules also allow queries Hive could not otherwise execute to be transformed into an executable representation – e.g. queries with INTERSECT, EXCEPT, … will be rewritten to use JOIN, GROUP BY, …

Calcite’s join reordering also allows bush query plan to be generated (e.g. join table A and B, then C and D, then join the results together, rather than just adding a table to the already-joined results each time).

Work is in progress on materialized view support, and future plans include collecting column statistics automatically, making better estimates of number of distinct values, and speeding up compilation. There should be an update on Hive performance on the Hortonworks blog in the next few weeks.

Running A Container Cloud On Yarn (Hortonworks)

Hortonworks builds, tests and releases open source software. As such, it does dozens of releases a year, with tens of thousands of tests per release across over a dozen Linux versions and multiple back-end databases. Therefore, they are looking to reduce overhead, and achieve greater density and improved hardware utilization.

Using a container cloud eliminates the bulk of virtualization overhead, improving density per node. Containers also help reduce image variance through composition. Startup time is fast, as there is no real boot sequence to run.

The building blocks for this cloud are:

  • YARN Container Runtimes – enable additional container types to make it easier to onboard new applications/services.
  • YARN Service Discovery – allow services running on YARN to easily discover one another.
  • YARN Native Services – enable long running YARN services.

For service discovery, the YARN Service Registry allows applications to register themselves, allowing discovery by other applications. Entries are stored in Zookeeper. The registry entries are exposed via the YARN DNS server, which watches the registry for changes and creates the corresponding DNS entry at the service level and container level.


The Columnar Roadmap: Apache Parquet and Apache Arrow (Julien Le Dem, Dremio)

Parquet and Arrow provide columnar storage for on-disk and in-memory respectively. The former has a focus on reading, with the expectation that data will be written once and read many times, whereas the latter is often for transient data and aims for maximisation of CPU throughput via efficient use of CPU pipelining, SIMD, and cache locality (which columnar structures support given that all the values for a given column are adjacent rather than interleaved with other columns).

The trade-off can be seen with e.g. Arrow having data in fixed positions rather than saving the space for NULL values, which gives better CPU throughput at the cost of some storage overhead.

The goal is for projects to adopt e.g. Arrow as a hub for interoperability, removing the need for duplicating functionality in many different projects, and also avoiding CPU costs of serialising/deserialising when moving data between different projects.

Exhibition Hall

As usual, a lot of people came to the Kognitio stand to talk about their practical problems in running SQL on Hadoop. Typically these revolve around connecting a lot of business users with their favourite tool (e.g. Tableau, MicroStrategy) to data stored in Hadoop. With a small number of users they tend to see issues, and they are then looking for a product which can give query performance on Hadoop with much higher levels of concurrency.

As mentioned in the query performance smackdown commentary above, this whitepaper has good information on benchmarking Impala, SparkSQL and Kognitio for industry-standard queries, including running with concurrency rather than a single stream, so if you read this post and have an SQL on Hadoop issue, that is a good reference point to start with.