Hadoop… Let’s not throw the baby out with the bath water again!
Here we go again! Suddenly the industry seems to have turned on Hadoop. Headlines saying “it’s hit the wall” and “it’s failed” have recently appeared and some are suggesting that organisations look at alternative solutions. Granted, Hadoop has its limitations and has not lived up to the massive hype that surrounded it a year or two ago, but then nothing ever does.
I admit I was not a fan of Hadoop when it first appeared; it seemed like a step backwards. It was very complicated to install, unreliable, and difficult to use, but still it caught the industry’s imagination. Engineers liked it because it was “proper engineering” not a shrink wrapped productionised product, and the business was seduced by the idea of free software. Pretty quickly it became an unstoppable runaway train and the answer “to life the universe and everything” was no longer 42 but Hadoop.
Great expectations generally lead to disappointment and this is Hadoop’s problem. We hyped it up to such an extent that it was always going to be impossible for it to live up to the expectations, no matter how much it improved, and it has, immeasurably! Hadoop is following the Gartner Hype Cycle (one of the cleverest and most accurate representations of how the perception of technology evolves) perfectly. It’s just for Hadoop the curve is enormous!
So what do I mean by let’s not throw out the baby with the bathwater again? In Hadoop’s early days the hot topic was NoSQL. The message was SQL was dead. The problem with SQL was that it was difficult to write the complicated mathematical algorithms required for Advanced Analytics and, as the name suggests, it relies on the data having structure. Advanced analytical algorithms are easier to implement, and unstructured data easier to handle, in languages such as “R” and Python. All perfectly true, but advanced analytics is just the tip of the data analytics triangle and the rest of the space is very well served by traditional SQL. Traditional BI reporting and self-service data visualization tools are still massively in demand and generally use SQL to access data. Even unstructured data is usually processed to give it structure before it is analysed. So when the NoSQL bandwagon claimed SQL was dead, they were effectively throwing out the most widespread and convenient method of business users getting access to Hadoop based data, in favor of something that only developers and data scientists could use.
Of course sense eventually prevailed, NoSQL morphed into Not-Only SQL, and now everyone and his brother is now implementing SQL on Hadoop solutions. The delay has been costly though and the perceived lack of fully functional, high performance SQL support is one of the key reasons why Hadoop is currently under pressure. I say perceived because there are already very good SQL on Hadoop solutions out there if people are willing to look outside the Apache box, but this is not a marketing piece so I will say no more on that subject. My point is that the IT industry has a history of using small weaknesses to suddenly turn on otherwise very useful technologies. There will always be those whose interests are best served by telling the industry that something is broken and we need to throw it away and start again. The IT industry’s problem is that it is often too easily led astray by these minority groups.
Hadoop has come a long way in a short time and although it has problems there is a large community of people working to fix them. Some point to the lack of new Apache Hadoop projects as a sign of Hadoop’s demise; I would argue that this is a positive thing with the community now focused on making and finding stuff that works properly rather constantly focusing on the shiniest, new cool project! I think that Hadoop is finally maturing.
This post first appeared on LinkedIn on April 18, 2017.