Four ways to get it wrong with SQL on Hadoop

More than half of professional developers work with SQL. And for all their early revolutionary fervor, even NoSQL databases are increasingly turning to SQL-like query languages. Despite its imperfections, SQL is how the world queries data.

Hadoop started from a fundamentally different position. Hadoop was about solving the problems of big data: variety, volume and velocity. This wasn’t like the neatly ordered world of relational databases. Here the issue was less about queryability and more, “Oh my god, what do we do with all this stuff?” Google’s MapReduce paper pointed one way to query massive volumes of semi-structured data and became, with HDFS, one of the cornerstones of Hadoop.

Then Hadoop got massively popular. Developers used to the declarative ease of SQL queries now had to think at a far lower level about how the system should go find the answers they wanted. Outside of its specific use cases at Google, MapReduce was slow, hard to work with and inflexible.

Fourteen years on from the first Hadoop release and several projects have attempted to apply SQL to Hadoop in efforts to make querying data easier and applicable to a broader set of use cases. Too often, though, they see SQL through the lens of the Hadoop ecosystem and, in the process, lose some of what makes SQL special.

Here are four of the ways that SQL on Hadoop projects tend to get it wrong.

1. SQL compatibility is not a box-ticking exercise

While the SELECT query is SQL’s workhorse, there’s far more to the language. Over the 43 years since SQL’s initial release, there have been eight major ISO revisions to SQL, each adding new functionality and complexity.

So, SQL is a moving target for anyone hoping to implement it. As SQL on Hadoop projects and vendors compete to cover more of the language, it’s important to go beyond the simple laundry-list of SQL features that each claims to support.

Here’s the problem: Hadoop is all about massively parallel processing. If your Hadoop on SQL tool implements the trickier SQL functions as a serial job, it’s a waste of infrastructure. Just like never taking a sports car out of second gear, you might as well not be using Hadoop if you’re running some queries as serial jobs. However, implementing those same functions as a parallel process immediately increases the implementation complexity by a factor of ten. Even then, there are different ways of parallelising queries; some more efficient but harder to implement and others that are cheaper to implement by using more CPU cycles than they should.

SQL on Hadoop vendors have made progress in just how much of SQL they claim to support but not all implementations are equal.

2. It has to be open source

There’s no doubt that open source has taken its place in the mainstream of software. But that’s not to say it’s always the best way to publish and distribute software.

One of the complexities of the Hadoop ecosystem is the lack of a single leader. Before anyone cries “FUD!”, this isn’t about bashing open source software. This is about the peculiarities of how Hadoop is developed and sold.

Each of the major Hadoop vendors has their own particular modules that they sponsor or, perhaps, develop entirely in-house. Across the whole set-up, though, the code running your Hadoop cluster comes from multiple open source projects, most of which are not authored by your particular vendor. When there’s a high priority problem that affects many Hadoop users, the open source model comes into its own and the itch is scratched. However, when you have a bug that affects maybe just you, then you need your vendor to step up and issue a patch for code their team may never have previously seen. Experiences vary.

So, open source in the Hadoop world can complicate matters. What’s more important than being open source is that you can try a tool without having to make a financial commitment first. If you can use and deploy, say, a SQL on Hadoop solution free of charge, then it lives or dies on its technical suitability to your problem.

3. Stuck on disk

You might expect that older SQL query engines for Hadoop, such as Hive, would be disk-bound. Surprisingly, newer alternatives – including Cloudera’s Impala – are also disk-based.

In many cases where SQL is of most use, data has a high entropy. In other words, the longer you leave it before querying, the less useful it becomes. Similarly, a lot of use cases of analytical data require real-time responses; consider business intelligence dashboards.

Disk-based query engines are necessarily slower than in-memory engines. When you’re processing massive numbers of parallel queries over the equivalent of billions of rows of data, the last thing you need is to be slowed down by hard disks. Even the fastest SSDs are one or two orders of magnitude slower than RAM.

Now, this is where some vendors will tell you that caching is the answer. Caching is not in-memory.

A cache is a copy of the most recently used disk data. The query optimiser does not know the data is in memory, so it can’t optimise for execution in memory. Even the execution engine running the query does not know the data is in memory, so it has to check by running expensive cache validation queries.

In a true in-memory system, every part knows that the data is in RAM. The optimiser optimises for memory operations and the execution engine never wastes CPU cycles asking the “cached, not cached?” question.
Don’t waste time and CPU cycles on disk-bound query engines.

4. Stick with what the vendor tells you

There isn’t really a single thing called Hadoop any more. Just like Linux, there are distributions that bring together different aspects of the Hadoop ecosystem and slap a support package on top.

Hortonworks’ Hadoop is not Cloudera’s Hadoop; and neither look like MapR’s spin on Hadoop.

So, as a customer your vendor gives you an approved list of tools and then many people stick with that. But if you wanted a cosseted, closed ecosystem why did you choose Hadoop in the first place?

We interviewed big data developers and architects earlier this year. Almost all of them told us that they chose Hadoop because it is open source and because it offers them flexibility to choose from a growing ecosystem of competing tools.

If you’re looking to get high performance SQL query from your Hadoop cluster, then sticking with what your vendor recommends is the wrong way to go.

Kognitio on Hadoop is a massively parallel, distributed, in-memory, ANSI SQL query engine that is in use on clusters of all Hadoop flavors. It addresses each of the four mistakes that we’ve covered here and it’s free to use.

You can read more about Kognitio on Hadoop and try it for yourself. If you have questions or comments, leave them at the bottom of this post.

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *

SQL on Hadoop. Bring your BI to life.

Read how to transform Hadoop into your best BI platform

Download the guide