How Hadoop helps scientists analyse big data

The era of big data is having an impact on all sections of society, but one area where the technology is proving particularly useful is in the field of scientific research.

It was noted by researchers at Princeton University that ever-expanding data sets are creating new challenges for numerous sectors. For instance, in the field of genomics, more than 500 000 microarrays are now publicly available – with each array containing tens of thousands of expression values of molecules.

Elsewhere, biomedical engineering is dealing with the creation of tens of thousands of terabytes of functional magnetic resonance images, while the rise of social media, e-commerce and surveillance data also contributes to the growing volume of information.

"Expanding streams of social network data are being channeled and collected by Twitter, Facebook, LinkedIn and YouTube," Princeton stated. "This data, in turn, is being used to predict influenza epidemics, stock market trends, and box-office revenues for particular movies."

With so much data to process and analyse, traditional methods for approaching this are no longer adequate. The researchers explained: "In many applications, we need to analyse internet-scale data containing billions or even trillions of data points, which makes even a linear pass of the whole dataset unaffordable."

Therefore, infrastructure that supports massively parallel processing and storage will be a must. The researchers explained that a 'divide and conquer' strategy is the best solution, with the idea being to partition a large problem into more tractable sub-problems – each of which will be handled in parallel by different processing units.

The researchers also highlighted Hadoop as a key technology for making the most of this. It was noted that the solution's MapReduce programing model for very large data sets in a parallel fashion.

Hadoop's HDFS distributed file system is also designed to host and provide high-throughput access to large datasets stored across multiple machines. The Princeton researchers added this "ensures big data's survivability and high availability for parallel applications".