A new programming language developed by researchers at the Massachusetts Institute of Technology (MIT) is claiming to be able to increase the speed of big data processing by up to four times.
Called Milk, the language allows application developers to manage memory more efficiently in programs that deal with scattered data points in large data sets. In tests conducted using several common algorithms, programs written in the new language were shown to be four times as fast as those written in existing languages, but the researchers behind the language believe that further work will result in even larger gains.
Milk is intended to solve one of the biggest barriers to successful implementation of big data analytics processes – how efficiently programs gather the relevant data.
MIT explained that traditional memory management is based on the 'principle of locality' – that is, if a program requires a certain piece of data stored in a specific location, it is also likely to need the neighbouring data, so it will fetch this at the same time.
However, this assumption no longer applies in the era of big data, where programs frequently require scattered chunks of data that are stored arbitrarily across huge data sets. Since fetching data from their main memory banks is the major performance bottleneck in today’s computer chips, having to do this more frequently can lead to major performance issues.
Vladimir Kiriansky, a PhD student in electrical engineering and computer science and first author on the new paper, explained that returning to the main memory bank for each piece of information is highly inefficient. He said: "It's as if every time you want a spoonful of cereal, you open the fridge, open the milk carton, pour a spoonful of milk, close the carton, and put it back in the fridge."
The new programming language aims to overcome this limitation through the use of batch processing, by adding a few commands to OpenMP, an extension of languages such as C and Fortran that makes it easier to write code for multicore processors.
When using the language, a programmer then adds a few lines of code around any instruction that iterates through a large data collection looking for a comparatively small number of items. Milk’s compiler then figures out how to manage memory accordingly.
Using Milk, when a core needs a piece of data, instead of requesting it – and any adjacent data – from main memory, it adds the data item's address to a locally stored list of addresses. When this list is long enough, all the chip's cores pool their lists, group together those addresses that are near each other and redistribute them to the cores. This means that each core requests only data items that it knows it needs and that can be retrieved efficiently.
Matei Zaharia, an assistant professor of computer science at Stanford University, noted that although many of today's applications are highly data-intensive, the gap in performance between memory and CPU means they are not able to fully utilise current hardware.
"Milk helps to address this gap by optimising memory access in common programming constructs. The work combines detailed knowledge about the design of memory controllers with knowledge about compilers to implement good optimisations for current hardware," he added.