Blogs

BI on Hadoop is possible, if you select the right SQL technology

So you have Hadoop up and running and data loaded. Your data scientists and software developers are happily writing programs to analyze the data, but you now want to provide access to a much wider audience so that people across your business can ask ad-hoc questions of the data, whenever they need to. These people probably already use a BI visualization tool that they really like e.g. Tableau, Qlik, MicroStrategy to name a few, so ideally you’d want to be able connect that tool directly to Hadoop.

Depending on where you are on this path you will probably fall into one of two camps.

1. The “can’t be done” camp
or
2. The “my Hadoop distribution comes with an SQL interface so I will just use that” camp

If you are in camp 1, then you will probably have already tried to do this and failed, or had significant problems and given up, or maybe you’ll have read one of the many articles that claim that Hadoop cannot be used as a BI platform.

If you are in camp 2, then you either have very understanding BI users, or have yet to try in earnest with high volumes of concurrent users.

The perception that Hadoop can’t be used for BI, or the belief that the SQL interface provided with the distribution will be sufficient, stem from a lack of understanding of where the problem lies. The fault is not with Hadoop. Different SQL on Hadoop technologies vary widely in their ability to support a BI workload. Not all SQL on Hadoop solutions are equal. Some are good at running SQL queries against huge data sets. They are generally slow and best suited to batch operations. But some are designed specifically for very high-speed interactive querying, at high concurrency levels, and can absolutely support a pervasive self-service BI workload.

In my opinion, these are the things you should consider when selecting a SQL interface for your Hadoop cluster:

1. You don’t actually have to choose

Now that all Hadoop distributions include the YARN resource manager, multiple SQL on Hadoop technologies can easily be deployed and used on the same cluster, allowing the best solution to be adopted for each use case.

2. SQL Coverage

SQL is a very large and complicated standard that covers a huge amount of query functionality. The standard first appeared in 1989 as an ANSI standard, but it was revised and extended in 1992, 1999, 2003, 2006, 2008 and 2011. Few products support everything, as quite a few of the revisions contain functionality that nobody ever adopted. This makes it difficult to compare products using ANSI standard compliance. Supporting ANSI 89 however, is generally insufficient as important data analytics functionality was not introduced until the later standards. Also standards compliance only tells you that the technology will accept a particular syntax, but not how well or efficiently it will execute the actual query.

Instead I would recommend looking at a technology’s ability to execute the 99 queries of the industry standard TPC-DS benchmark. Within its 99 queries, this benchmark covers most of the important stuff used in data analytics, as well as providing an indication of performance. Couple this with a more subjective look at a technology’s SQL maturity, and in particular its parallelization pedigree, and you get a better view of performance. Hadoop is, after all, a massively parallel platform and SQL queries need to efficiently use that parallelization to have any chance of supporting BI on Big Data.

3. Query Performance

Although SQL on Hadoop technologies are all trying to implement a standard language for accessing data stored in Hadoop, under the covers they use very different techniques to provide that access and this leads to dramatically different performance profiles. Some will query data directly from disk, some will dynamically cache portions of the data in memory and some will “pin” images of data directly into memory. Some will re-structure the data to improve certain types of query and some will used advanced techniques to improve code efficiency. While it is possible to broadly classify a technology as faster or slower, it is important to bear in mind that relative performance will vary depending on query type, for example a technology may perform well for simple “select” queries, but really struggle as the query complexity increases.

Visualization tools such as Tableau, however, tend to produce fairly complex queries. Unfortunately there is no easy way to ascertain which technology will perform best for your BI workload. Again the TPC-DS benchmark can provide a good initial guide, but ultimately you will need to test it yourself using a representative workload. Bear in mind point 1 above and think about installing multiple SQL technologies and let time show you which works best for each of your workloads. (One more thing to mention here – use of memory. It’s now widely accepted that if you want to run SQL queries at a speed that can support interactive BI, then the queries will need to be executed from data held in memory; disks are just way too slow, even SSDs. I have previously written in detail about in-memory here, but suffice it to say, simply moving data from disk to memory does not make a system “in-memory” and a cache does not give the same results as data “pinned” in memory.

4. Concurrency

To support pervasive self-service BI, the SQL on Hadoop technology must be able to maintain performance even when it is being asked to run many different queries at the same time. Good single query performance does not mean it will necessarily cope well with concurrent workloads. Concurrency support is very hard to achieve and tends to come with product maturity. The TPC-DS benchmark does have tests for concurrency (called multi-stream), but though there are lots of published results from vendors for a single query stream, there are very few showing multi-stream results. Kognitio has published both single stream and multi-stream test results for the major SQL on Hadoop technologies here.

5. Ease of adoption

As most SQL on Hadoop technologies are either open-source or free-to-use, there are now few financial barriers to adoption. However, previous poor experiences with new technologies on existing Hadoop installations have made organisations cautious about trying new stuff. Development or test clusters can obviously be used, but these are often small and unrepresentative. In these circumstances, cloud-based testing can work very well. For temporary test systems, Amazon spot pricing is great and allows significant clusters to be built at surprisingly low cost. All of the main Hadoop distributions now support cloud-based installations.

The following chart lists the main SQL on Hadoop technologies available today, with some indication of how I believe they compare based on the above criteria. The comparison is mostly based on the comprehensive benchmarking exercises we have undertaken, but also on some more subjective analysis. If you have a contrary opinion, I’d be happy to hear your input so go ahead and comment on this blog.

Free to adopt Hadoop distro Support for high concurrency SQL coverage & maturity Performance Memory layer Comments
Kognitio Yes All High High Very high Pinned images Kognitio Analytical Platform ported to Hadoop
Hive Yes All Low Med Very Low No The metadata provider
Impala Yes CDH Low Med Med Block cache Not a YARN app
Hive LLAP Yes HDP Low Med Med Cache Need to warm up
Presto Yes All Low Low Low Some caching Not a YARN app. Lots of connectors
SparkSQL Yes All Low Med Low Inefficient images Go to App for analytics
BigSQL No All (with Hortonworks partnership) Med High Med No Spark, Hbase and IBM DB2
Drill Yes All Low Low Low No Federation layer



If you found this blog interesting, you might also like to read the following:

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *