How tools like Spark are speeding up big data

When companies are analysing potential big data analytics solutions, there are several factors that need to be taken into consideration – central of which are the 'three Vs' of volume, variety and velocity.

In the past, many businesses placed much of their focus on volume, as companies needed to overhaul their systems to cope with the huge increase in data brought about by a more digital world. But once this information has been collected, the question becomes about how businesses get the most out of it – and this is where velocity is playing an increasingly important role.

It was noted by Computer Weekly that getting results quickly has proved to be a major challenge for many enterprises. Although tools such as Hadoop have greatly helped businesses store and analyse huge amounts of data at affordable costs, it is not renowned for its speedy results.

For instance, the MapReduce programming model that analyses data stored in the Hadoop distributed file system (HDFS), as well as being difficult to learn, is designed primarily for batch processing. This can take time to process queries and so is not well-suited for certain tasks.

Matt Aslett, research director for data platforms and analytics at 451 Research, told the publication that while Hadoop has opened up new opportunities for applications such as fraud detection, online advertising analytics and e-commerce recommendation engines, these solutions need a more rapid turnaround from data to conclusion to be effective.

"Batch processing is OK, but if it takes an hour or two, it’s not great for these applications," he stated.

However, there are solutions emerging to address this issue, such as Apache Spark. Mr Aslett observed: "With Spark and in-memory processing, you can get the response down to seconds, allowing real-time, responsive applications."

While interest in such solutions has been building for a while, there has recently been a huge upturn in demand, which Mr Aslett said is partly a result of more Hadoop providers backing the technology. In many cases, tools like Spark are being used to complement batch processing, enabling more in-memory solutions for real-time analytics.

The benefits of Spark and its ilk have already been demonstrated in a variety of benchmark tests. For instance, an analysis by AMPLab in 2013 found Spark could perform up to 100 times faster than MapReduce for certain applications.

While not every use case will provide such extreme results, many everyday activities stand to benefit from the speed boost provided by Spark and similar offerings.

For example, high-performance computing consultants OCF uploaded the Hermann Hesse novel Siddartha on both HDFS and Spark in-memory to compare the time to count the words in the 700MB file. Hadoop was able to complete the task in 686 seconds; Spark could do it in 53 seconds, or 13 times faster.

Zubin Dowlaty, vice-president and head of innovation and development at big data consultancy Mu Sigma, told Computer Weekly that although IT departments can use technologies like Spark and HDFS to work on new data sets with new programming tools, the key to success will be getting data professionals to change their mindset.

"They need to shake themselves up with respect to the agility these new tools provide," he said. "They need to shake up the business to think bigger. Computation can really scale now, so you can do much more. But the leadership is not coming from the chief technology officer, it is coming from the chief marketing officer or other C-suite executives."

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *