How can tools like Spark give Hadoop a boost?

Next week, big data experts and information professionals from around the world are set to descend on New York for the 2014 Strata + Hadoop World conference. Taking place between October 15th and 17th, the event will see thousands of attendees gather to learn more about the latest developments in big data analytics.

And this year, one of the big developments that everyone's talking about is Apache Spark. This in-memory platform is intended to help enterprises load and query large volumes of data at exceptionally high speeds, making it a great solution for intense analytics operations such as machine learning where huge amounts of data need to be processed very quickly.

Last month, Hortonworks announced it was extending Spark to make it work more effectively with Hadoop's YARN system, which allows multiple data processing engines to interact with data stored in a single platform.

In a blog post introducing the enhancements, Hortonworks' Vinay Shukla and Tim Hall said: "There has been unbridled excitement for Spark over the past few months because it provides an elegant, attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques."

Indeed, co-founder of Hortonworks Arun Murthy stated in an interview with ZDNet that Spark is one of the "most interesting" developments to come out of the open-source community and is a testament to the resilience of the Hadoop ecosystem and the innovations of the community.

As memory gets cheaper, solutions such as Spark and other tools that take advantage of the power and speed offered by in-memory processing are likely to become much more widespread. Mr Murthy noted he has customers who are now running 100GB of RAM on every box, which means that by linking together a few of these machines, companies can easily gain two or three terabytes of RAM to help in their analytics.

"When I started off in Hadoop, our servers would have about 4GB to 8GB of RAM per box. That was state of the art at that point," he continued. "Today it's not 4GB or 8GB; it's 128GB or 256GB of memory. So Spark is the right technology at the right time."

Because of this, it is now very appealing for data scientists to look at the capabilities they have available to them and plan for much more in-depth analytics that would previously have been possible. With large amounts of memory to assist them, solutions such as real-time or predictive analytics and algorithms for machine learning and modelling become much more cost-effective. 

The improved management of Spark within YARN may play a major role in bringing the benefits of in-memory computing to more organisations. Mr Shukla and Mr Hall said that the deeper integration will allow Spark to become a more efficient tenant alongside other engines, such as Hive, Storm and HBase and others, simultaneously, all on a single data platform. 

"Fundamentally, our investment strategy continues to focus on innovating at the core of Hadoop and we look forward to continuing to support our customers and partners by contributing to a vibrant Hadoop ecosystem that includes Apache Spark as yet another data access application running in YARN," the Hortonworks employees added.