Transforming Hadoop into your best BI platform:

An expert how-to guide

Exploring the nature of the problem

What we’ll cover

  • The challenge of ensuring everyone works with the same data set
  • The problem of speed – and high-concurrent workloads
  • The ineffective nature of most SQL-on-Hadoop engines

Let’s look at these problems in more detail…starting with issue one.

Problem one: inability to perform BI tasks on Hadoop

An inability to perform BI tasks directly on data in Hadoop is causing organizations to aggregate individual data sets and push them out to business users

Hadoop has over time developed into a very useful platform for data storage and batch processing. That said, when it comes to providing interactive ad hoc data analysis we hit a problem – the data has to be moved onto another platform. This is partly due to its batch processing skill set (for which it was designed), that can lead to long response times for such impromptu analysis: and partly due to the complexities involved in taking serial SQL operations and ‘parallelizing’ them from scratch.

The tendency is therefore to rely on data aggregation, and the creation (by aforementioned data scientists) of data subsets that are loaded directly into existing BI tool sets. This in turn creates a range of performance and governance issues with users working on their own personal subsets, thereby reducing the potential for creating a ‘single version of the truth’. In addition, these data sets are by their very nature limited in terms of the actual data contained within them, and based on assumptions made by their creators on the type and spread of the data needed to complete a specific task.

As a result, with Hadoop approaching (on the Gartner hype cycle) the ‘trough of disillusionment’, the experience for many users is that it’s easy to put data into the platform, but inefficient and cumbersome when taking insights out. The result is constant (and unmanageable) data movement, a lack of ROI and platform utilization, and the loss of the more subtle insights that come from interrogating one central data ‘lake’.

Problem two: analytics on Hadoop is too slow

Running data analysis directly in Hadoop is too slow, and made worse with high-concurrency workloads.

Part of what gives Hadoop the wow factor is its ability to store vast sets of structured and unstructured data. Yet extracting meaningful value from these resources in a fast and effective manner, and working in real-time with advanced analytics – remains out of reach for many organizations.

But is it fair to lay the blame solely at the feet of Hadoop? Or should we instead charge the tools being deployed on top of it for failing to make the most of the platform? Tools elevated beyond their ability to deliver due to the hype surrounding Hadoop, and the expectation for a solution to the big data ‘crisis’ – alongside poor interfaces that add to the overall complexity.

Talk of failure to date only helps illustrate that many current adopters have implemented Hadoop without properly understanding it – and then failed to bring together the right tools, data, and expertise to get it working properly. The platform may stand accused of being a poor BI platform, but it can point to a variety of extenuating circumstances!

Problem three: SQL engines are too slow to be effective

The SQL products on the market that are dedicated to solving these problems are themselves too slow to be effective

In a way this problem speaks of a lack of depth in the maturity and capabilities on offer from SQL-on-Hadoop products and services. Tools that have typically failed to address the challenges of query performance (even under just a single user’s workload); as well as concurrency, where the need to offer multiple users access to the same data sources at the same time, leads to errors and snaillike interactivity.

Yet the problem also points the finger at the fact that many of the currently available SQL query engines are disk-based, and significantly slower in response than in-memory engines. This point is particularly pertinent to the speed equation, as the ability to process massive numbers of parallel queries (over billions of rows of data) will inevitably be compromised by hard disks. Even the latest SSDs on the market are way behind the performance on offer from RAM.

Exploring the solution

In this section we’ll cover: the importance of SQL as THE data query tool, how ultra-fast, high concurrency analysis is possible on Hadoop, the ability to say ‘no’ to any more data aggregation

Keep reading