After a busy November at Big Data London and working with various clients I've hadRead More
Marriage counsellors needed: Hadoop and BI
We’ve reached the age of pervasive BI, where business insights are placed at the fingertips of those who make the key strategic decisions for success.
Leading businesses are data-driven, focused on extracting as much insight and understanding from the data they consume. In the main, Hadoop has been a technology that has facilitated the collection, processing and storage of data in many formats, shapes and sizes.
But one of the key issues with Hadoop implementations has been how to monetize all the data in Hadoop and enable business users to query this valuable big data in the manner they’ve become used to, with their own choice of commercial, off-the-shelf analytics products, like Tableau, Qlik or similar.
These off-the-shelf analytics products will translate their queries into SQL. But taking serial SQL operations and “parallelizing” them for a parallel platform like Hadoop is highly complex. So the result of trying to connect these tools to Hadoop is at best, painfully slow queries and at worst, complete failure to run at all.
Counselling with SQL on Hadoop
As part of the Hadoop stack, there are SQL engines that purport to improve those operations. But the absence of engines that can truly work interactively between Hadoop and off-the-shelf analytics products has meant that businesses have kept Hadoop-based data in the domain of data scientists, skilled in coding, and the business user is then reliant on aggregated data sets that are moved to standard databases. For the business user, data sets are limited in terms of size, scope and that are rapidly ageing from the moment of birth.
SQL on Hadoop tools have typically failed to address the challenges of query performance (even under just a single user’s workload). The problems range from the lack of ANSI SQL compliance, speed of querying and concurrency of usage (the need to offer multiple users access to the same data sources at the same time, not forgetting the aim of still maintain adequate query speeds when this is achieved!).
Many of the currently available SQL query engines are disk-based, and significantly slower in response than in-memory engines. This point is particularly pertinent to the speed equation, as the ability to process massive numbers of parallel queries (over billions of rows of data) will inevitably be compromised by hard disks. Even the latest SSDs on the market are way behind the performance on offer from RAM.
Evaluating the performance of SQL on Hadoop
There’s no getting away from SQL, no matter how far NoSQL has come along. It’s still the most widely used query language in the world. The right SQL query engine can transform your experience of Hadoop as a BI platform and enable your business to break free from the confines of data science laboratories and allow self-service BI on Hadoop to enter the mainstream.
Want to know your options for getting SQL to work on Hadoop? We’ve used the TPC-DS query set to measure the performance of the various SQL on Hadoop tools. In this blog, Sharon Kirkham explains why TPC-DS is an effective means of testing SQL on Hadoop. And in this blog, Sharon outlines all of the results of the tests.