Over the past year, the Kognitio Analytics Team has been testing different open source SQLRead More
BI on Hadoop is possible, if you select the right SQL technology
So you have Hadoop up and running and data loaded. Your data scientists and software developers are happily writing programs to analyze the data, but you now want to provide access to a much wider audience so that people across your business can ask ad-hoc questions of the data, whenever they need to. These people probably already use a BI visualization tool that they really like e.g. Tableau, Qlik, MicroStrategy to name a few, so ideally you’d want to be able connect that tool directly to Hadoop.
Depending on where you are on this path you will probably fall into one of two camps.
1. The “can’t be done” camp
2. The “my Hadoop distribution comes with an SQL interface so I will just use that” camp
If you are in camp 1, then you will probably have already tried to do this and failed, or had significant problems and given up, or maybe you’ll have read one of the many articles that claim that Hadoop cannot be used as a BI platform.
If you are in camp 2, then you either have very understanding BI users, or have yet to try in earnest with high volumes of concurrent users.
The perception that Hadoop can’t be used for BI, or the belief that the SQL interface provided with the distribution will be sufficient, stem from a lack of understanding of where the problem lies. The fault is not with Hadoop. Different SQL on Hadoop technologies vary widely in their ability to support a BI workload. Not all SQL on Hadoop solutions are equal. Some are good at running SQL queries against huge data sets. They are generally slow and best suited to batch operations. But some are designed specifically for very high-speed interactive querying, at high concurrency levels, and can absolutely support a pervasive self-service BI workload.
In my opinion, these are the things you should consider when selecting a SQL interface for your Hadoop cluster:
1. You don’t actually have to choose
Now that all Hadoop distributions include the YARN resource manager, multiple SQL on Hadoop technologies can easily be deployed and used on the same cluster, allowing the best solution to be adopted for each use case.
2. SQL Coverage
SQL is a very large and complicated standard that covers a huge amount of query functionality. The standard first appeared in 1989 as an ANSI standard, but it was revised and extended in 1992, 1999, 2003, 2006, 2008 and 2011. Few products support everything, as quite a few of the revisions contain functionality that nobody ever adopted. This makes it difficult to compare products using ANSI standard compliance. Supporting ANSI 89 however, is generally insufficient as important data analytics functionality was not introduced until the later standards. Also standards compliance only tells you that the technology will accept a particular syntax, but not how well or efficiently it will execute the actual query.
Instead I would recommend looking at a technology’s ability to execute the 99 queries of the industry standard TPC-DS benchmark. Within its 99 queries, this benchmark covers most of the important stuff used in data analytics, as well as providing an indication of performance. Couple this with a more subjective look at a technology’s SQL maturity, and in particular its parallelization pedigree, and you get a better view of performance. Hadoop is, after all, a massively parallel platform and SQL queries need to efficiently use that parallelization to have any chance of supporting BI on Big Data.
3. Query Performance
Although SQL on Hadoop technologies are all trying to implement a standard language for accessing data stored in Hadoop, under the covers they use very different techniques to provide that access and this leads to dramatically different performance profiles. Some will query data directly from disk, some will dynamically cache portions of the data in memory and some will “pin” images of data directly into memory. Some will re-structure the data to improve certain types of query and some will used advanced techniques to improve code efficiency. While it is possible to broadly classify a technology as faster or slower, it is important to bear in mind that relative performance will vary depending on query type, for example a technology may perform well for simple “select” queries, but really struggle as the query complexity increases.
Visualization tools such as Tableau, however, tend to produce fairly complex queries. Unfortunately there is no easy way to ascertain which technology will perform best for your BI workload. Again the TPC-DS benchmark can provide a good initial guide, but ultimately you will need to test it yourself using a representative workload. Bear in mind point 1 above and think about installing multiple SQL technologies and let time show you which works best for each of your workloads. (One more thing to mention here – use of memory. It’s now widely accepted that if you want to run SQL queries at a speed that can support interactive BI, then the queries will need to be executed from data held in memory; disks are just way too slow, even SSDs. I have previously written in detail about in-memory here, but suffice it to say, simply moving data from disk to memory does not make a system “in-memory” and a cache does not give the same results as data “pinned” in memory.
To support pervasive self-service BI, the SQL on Hadoop technology must be able to maintain performance even when it is being asked to run many different queries at the same time. Good single query performance does not mean it will necessarily cope well with concurrent workloads. Concurrency support is very hard to achieve and tends to come with product maturity. The TPC-DS benchmark does have tests for concurrency (called multi-stream), but though there are lots of published results from vendors for a single query stream, there are very few showing multi-stream results. Kognitio has published both single stream and multi-stream test results for the major SQL on Hadoop technologies here.
5. Ease of adoption
As most SQL on Hadoop technologies are either open-source or free-to-use, there are now few financial barriers to adoption. However, previous poor experiences with new technologies on existing Hadoop installations have made organisations cautious about trying new stuff. Development or test clusters can obviously be used, but these are often small and unrepresentative. In these circumstances, cloud-based testing can work very well. For temporary test systems, Amazon spot pricing is great and allows significant clusters to be built at surprisingly low cost. All of the main Hadoop distributions now support cloud-based installations.
The following chart lists the main SQL on Hadoop technologies available today, with some indication of how I believe they compare based on the above criteria. The comparison is mostly based on the comprehensive benchmarking exercises we have undertaken, but also on some more subjective analysis. If you have a contrary opinion, I’d be happy to hear your input so go ahead and comment on this blog.
|Free to adopt||Hadoop distro||Support for high concurrency||SQL coverage & maturity||Performance||Memory layer||Comments|
|Kognitio||Yes||All||High||High||Very high||Pinned images||Kognitio Analytical Platform ported to Hadoop|
|Hive||Yes||All||Low||Med||Very Low||No||The metadata provider|
|Impala||Yes||CDH||Low||Med||Med||Block cache||Not a YARN app|
|Hive LLAP||Yes||HDP||Low||Med||Med||Cache||Need to warm up|
|Presto||Yes||All||Low||Low||Low||Some caching||Not a YARN app. Lots of connectors|
|SparkSQL||Yes||All||Low||Med||Low||Inefficient images||Go to App for analytics|
|BigSQL||No||All (with Hortonworks partnership)||Med||High||Med||No||Spark, Hbase and IBM DB2|
If you found this blog interesting, you might also like to read the following: