This week sees one of the biggest events of the year for the big data industry, as experts from around the world converge on New York for the annual Strata + Hadoop World conference. Taking place at the city's Javits Center between October 15th and 17th, the event will see almost 200 sessions run across the three days, covering all the latest advances in Hadoop technology.
Keynote speakers at this year's event include Cloudera's chief technologist Eli Collins, who will be leading a session on how businesses can harness the immense power of big data, while avoiding the pitfalls that come along with it.
Elsewhere, director of market research at O'Reilly Media and chair of the Strata conferences Roger Magoulis will host a debate on the role of coding in big data. This will discuss whether it is possible to be a successful data scientist armed with only a knowledge of graphical tools such as Excel or Tableau to explore data, or whether coding skills will be required in order to make the most of tools like Hadoop, R and Spark.
Throughout the event, sessions will take place featuring case studies, best practice guides and Q&A events covering all aspects of big data technology, including how to scale up Hadoop deployments, how to visualise the data firms receive from their operations, and what the future of the industry is set to look like.
This year, one of the big topics for discussion will be Apache Spark, and how businesses will look to incorporate it and other in-memory processing solutions into their Hadoop deployments. As the volume and variety of data firms collect continues to grow at an exponential rate, demand to be able to process this quickly will also rise, as traditional technologies may begin to struggle under the strain.
Solutions such as Spark look to solve this by taking advantage of advances that have made memory cheaper and faster. And last month, Hortonworks announced it had extended Spark to improve the way it integrates with Hadoop's YARN system.
Some of the capabilities of the technology have already been demonstrated by Databricks, which claimed to have broken the world record for the fastest processing of big data.
The firm, founded by the creators of Spark, completed the third-party benchmarking test GraySort – which is a distributed sort of 100 terabytes of on-disk data – in just 23 minutes, using 206 machines with 6,592 cores. This smashed the previous best of 70 minutes, set by Yahoo! using a large, open-source Hadoop cluster of 2100 machines.
Ion Stoica, chief executive of Databricks, commented: "Beating this data processing record previously set on a large Hadoop MapReduce clusters not only validates the work we've done, but also demonstrates that Spark is fulfilling its promise to serve as a faster and more scalable engine for all data processing needs."
With businesses always looking for solutions that enable them to process large amounts of data at greater speed, the achievement may well lead to more companies taking a serious look at in-memory processing, so no doubt discussions on how best to incorporate this into big data solutions will be particularly prominent at this year's Strata + Hadoop World event.