Transforming Hadoop into your best BI platform:
An expert how-to guide

Marriage counsellors needed: Hadoop and BI

What we’ll cover

  • The promise of Hadoop for managing big data environments
  • The limitations to emerge that have impacted this reputation
  • The problem of querying data in Hadoop

This paper comes with a big, bold claim and one we’re prepared to stand by.

Yes Hadoop, and yes for making it the very foundation for delivering the best business intelligence platform your organization has ever experienced.

But where to start? Well, logically it makes sense to begin by taking a look at Hadoop itself, to ask the question: “why isn’t it currently the best BI platform you have?”

Hadoop is of course synonymous with big data. Indeed, the growing need inside organizations to process and analyze structured/unstructured data – alongside the associated challenges of data quality, storage, and security – was instrumental in Hadoop’s development as a data processing platform.

Hence the massive hype surrounding the technology, and why organizations including Facebook – famous ‘crunchers’ of big data – have introduced Hadoop into their data analytics infrastructure.

Keeping touch with reality

Despite Hadoop being considered essential to the spread of big data, adoption across enterprises has not necessarily resulted in the promises surrounding it.

Why? Well, to put it simply, limitations have emerged. Not, it must be said, when handling terabytes and petabytes of data through batch processing and ETL (Hadoop’s sweet spot). Rather it’s a challenge of enabling your business users to query the data in the manner they’ve become used to with commercial, off-the-shelf analytics products.

Such a situation places the spotlight on the speed of the interactive SQL. What does this mean?

Keeping it interactive

Hadoop was first conceived as a NoSQL platform. Despite still being used by over 50% of professional developers, SQL was seen as an overly complex, ‘old world’ standard for querying data.

Moving away from it was therefore seen as a good thing: positioning Hadoop as a virtual playground full of possibilities for engineers to reinvent a business’ relationship with its data resources.

All good in theory, yet in practice the absence of SQL has led to blockages in the system. Data Scientist sized blockages. That’s because it took a seasoned Data Scientist to be able to extract value from Hadoop.

Business users were (and still are) excluded from most activities, and reliant on the experts to construct aggregated data sets in Hadoop – before moving them into standard databases for exploitation. Data sets that are by their very nature limited in terms of size and scope – and rapidly ageing from their moment of birth. Such challenges have caused businesses to reassess Hadoop.

Keeping a lid on it

“Great for serving as a cheap storage repository, and for processing ETL workloads”, is a consensus view that’s now matched in part by more negative considerations: “the complexity for running in-memory, parallel workloads limits its ability to support interactive, user-facing apps”.

What does this mean for the world of BI? The challenges here fall into three principle challenges:

1. An inability to perform BI tasks directly on Hadoop is causing organisations to aggregate individual data sets and push them out to business users

2. Running data analysis directly in Hadoop is too slow, and made worse with high-concurrent workloads

3. The SQL products on the market that are dedicated to solving these problems are themselves too slow to be effective

Equally, these problems aren’t occurring in isolation to wider business trends. Rather they’re happening against a backdrop of digital transformation agendas, and the need for real-time intelligence to be made available at the ‘sharp end’ – and placed at the fingertips of those users able to respond to the resulting insights. In other words, we’ve reached the age of pervasive BI, where organizations can have hundreds if not thousands of users simultaneously demanding access to the same data resources at the same time.

Exploring the nature of the problem

In this section we’ll cover: the challenge of ensuring everyone works with the same data set, the problem of speed – and high-concurrent workloads, the ineffective nature of most SQL-on-Hadoop engines

Keep reading