Transforming Hadoop into your best BI platform:

An expert how-to guide

Exploring the solution

What we’ll cover

  • The importance of SQL as THE data query tool
  • How ultra-fast, high concurrency analysis is possible on Hadoop
  • The ability to say ‘no’ to any more data aggregation

There’s no escaping it: SQL rocks

What the above tells us, as far as a solution is concerned at least, is that SQL needs to be involved – and involved heavily

It is after all the most widely used query language in the world. People are familiar with it, and it remains the language that popular data visualization tools use to ask questions of a relational DBMS.

The challenge now is to extend this into non-relational data stores, to use SQL as a common language that helps inspire simplified data access to assets located in multiple stores – without having to switch between different APIs to make it all work.

Step forward SQL-on-Hadoop

That’s right: SQL on the ultimate noSQL platform – SQL query engines for Hadoop in big data systems that can transform your experience of BI platforms. Tools that return IT to a state of ease and familiarity when it comes to programming analytics apps and integration tasks. Tools that enable developers to make use of their SQL skills and capabilities within extensive Hadoop data lakes, and overcome the ‘weak’ relational functions within the platform. Capabilities that once and for all help Hadoop break free from the confines of the data scientists’ laboratories, and to enter the BI mainstream.

Five core supporting reasons for SQL on Hadoop

1. SQL is the language of data query, is proven to work in big data environments (eBay uses it to process 50 petabytes each day), and is used by all modern data visualization tools for accessing data

2. SQL is the preferred language of the data management community, and sits naturally with their existing tool sets

3. SQL offers immediate returns – most businesses are familiar with it, and make use of it on a daily basis

4. Fast and efficient ad hoc exploration of Hadoop data enabled by SQL is a top priority and essential for justifying long-term investment in the platform

5. Self-service analytics is increasingly seen as business-critical, and without SQL tools this will be limited, thereby limiting the range of users able to extract value from Hadoop data

The best platform for analyzing billions of transactions

The world’s largest payment card issuer has 10,000 active Tableau users analyzing data held in a nine petabyte Hadoop cluster in near real-time.

Their aim is to identify spending patterns across a data set that covers 12 billion transactions. To do this, the company had two options:

1. Create and maintain 10,000 near real-time data extracts for individual clients – which operationally would most likely prove to be impossible

2. Use a query engine for Hadoop capable of handling hundreds of concurrent complex SQL queries over the entire data set – and return the results in near real-time.

Unsurprisingly they went with option 2: Kognitio.

Addressing the three problems of BI on Hadoop

In this section we’ll talk what-ifs; let’s talk ultra-fast, high-concurrency SQL for Hadoop and data warehousing; and let’s talk about scale-out, in-memory software that enables modern BI and visualization tools to maintain their performance – even when the data volume is large and the user count high.

Keep reading