The Strata Data Conference was held at the ExCeL in London this week. Sessions thatRead More
A Review of Strata Data Conference, London 2017
What Kaggle has learned from almost a million data scientists (Anthony Goldbloom, Kaggle)
This was part of the keynote on the first day. Kaggle is a platform for data science competitions, and have had almost 1 million users participate. Over 4 million models have been submitted to competitions, and this presentation covered off the traits Kaggle have seen for successful entrants.
In particular, for structured data the trend is for competitors to initially explore the data via histograms etc. to get a better understanding of it, then create and select features for use in their approach, which typically involves a classifier. The work on features is more important than the choice of classifier, and successful competitors are usually very creating in choosing features (e.g. car colour type for predicting car resale value), and persistent as most intuitions around feature selection/creation prove to have little correlation with the end goal. Finally, the best competitors tend to use version control for their models, to make it easier to track the success/failure of each approach.
Creating real-time, data-centric applications with Impala and Kudu (Marcel Kornacker, Cloudera)
The room was packed for this session as Marcel gave some background on Kudu (a relational store that can be used as an alternative to HDFS) and Impala. He explained that Kudu avoided the limitations on delete, update and streaming inserts seen with HDFS, and the poor full table scan performance of HBase. As Kudu does not use HDFS for persistence, it handles its own replication, although this means it can’t e.g. benefit from the reduced storage overhead planned for HDFS in Hadoop 3. One workaround would be to only keep e.g. the latest 12 months worth of data in Kudu, and push older data into HDFS to benefit from its reduced storage overhead when that is available.
Kudu has been tested up to 275 nodes in a 3PB cluster, and internally uses columnar format when writing to disk, having collected records in RAM prior to this transposition. It allows range and hash partitioning to be combined. For example, you could use range partitioning by date, but then hash within a date to keep some level of parallelism when dealing with data for one particular date. Currently it only supports single-row transactions but the roadmap includes support for multi-row. From the examples given it appears there are some local predicates it cannot handle (e.g. LIKE with a regular expression), and batch inserts are reportedly slow. Multi-versioning is used as with Kognitio and many other products.
Impala can use all three of those storage options (Kudu, HDFS, HBase), has over 1 million downloads, and is reportedly in use at 60% of Cloudera’s customers.
Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop (Marcel Kornacker, Cloudera)
Marcel started by going through some performance benchmarks involving Impala, and highlighted the importance of moving beyond single user benchmarks.
He then moved onto some techniques for improving Impala performance including:
Partitioning: partitions can be eliminated by join lookup to generate run-time filters (what Kognitio used to call spider joins) – so if joining fact and date tables on a date key and having a local predicate on the data table, then that predicate can be used to generate a list of relevant date keys, and that filter can be applied to the fact table before the join. This appeared to be the biggest factor in Impala’s TPC-DS performance improvements in early 2016. Marcel advised regularly compacting tables to keep file and partition sizes as small as possible, and gave general advice to stick with less than 20,000 partitions (too few and you don’t eliminate enough data with filters, too many and you lose parallelism and put extra load on name node etc.). As in the example above, partition on join keys to get benefit from run-time filters.
Sorting: this will be added in the next release. It is particularly useful as Parquet can store stats on e.g. min and max values within a page, so sorting can help eliminate some of those pages when there are too many column values for partitioning.
Use appropriate data types: some operations are a lot more expensive with different data types (e.g. strings), so try to avoid using these expensive data types. Kognitio used to offer similar advice to customers before modifying their product to make operations like string aggregation as efficient as integer aggregation.
Complex schemas: parent-child relationships with nested collections offer physical colocation of data, giving a natural optimization. Need to use columnar storage for this approach as resulting tables are wide.
Statistics: it takes a long time to collect these, so customers often ask if they can get the same effect with using e.g. optimiser hints to determine the order of joins. That is not the case, as statistics are used for far more than determining join ordering – e.g. scan predicates are order by selectivity and cost, the selectivity of scans is computed, join sides need to be determined for efficiency, join type needs to be decided, a decision on whether to apply run-time filters needs to be made (as presumably the cost of generating and applying these can be significant). The ability to collect statistics on samples is being added, which would speed up the stats collections.
A deep dive into Spark SQL’s Catalyst optimizer (Herman van Hovell tot Westerflier, Databricks)
In an early slide entitled “Why structure” Herman showed the performance benefit of using SQL for a simple aggregation tasks rather than working directly on RDDs with code. He then outlined the approach for query optimization used by Catalyst, from ensuring the query was valid syntactically and semantically, generating a logical plan, optimizing that plan, then generating physical plans which have their cost compared until a final physical plan is selected.
He discussed the use of partial functions to specify transformations of plans (e.g. constant folding), and then showed how it was possible to write your own planner rules to be incorporated into optimization.
It wasn’t clear to me how the optimizer deals with having a vast number of potential rules to apply, with some being inapplicable at one stage in optimization, but then being valid later on in the process after other rules have been applied.
Journey to AWS: Straddling two worlds (Calum Murray, Intuit)
The key takeaways were:
- use a tip-toe approach rather than big-bang. Be in a position where you can flip back and forth from cloud for an application, then swap to cloud initially for a small number of minutes, gradually increasing to 24 hours or more.
- swim-lane applications first, if possible, to allow for this approach (so you are only exposing a small subset of users to the change initially).
- consider security implications – in their case, with over 10 million customers, they had to put extra encryption in place, use different encryption keys for on-premise and cloud, etc.
You can find speaker slides and videos from the conference at https://conferences.oreilly.com/strata/strata-eu/public/schedule/proceedings
Conversations at the Kognitio stand in the exhibition hall reflected an increasing interest in how to address SQL on Hadoop problems with company’s installed Hadoop systems. This was a marked step forward from last year’s conference where a lot of the conversations were with companies that were just starting out with Hadoop, and hadn’t done enough with Hive/Impala/Spark SQL to have encountered significant problems that they needed to solve.