Blog

Harvard seeks to tackle big data storage challenges

With a growing number of companies looking to expand their big data analytics operations in the coming years, one key consequence of this will be an explosion in the amounts of data that businesses will have to store.

Therefore, finding cost-effective solutions for this will be essential if such initiatives are to be successful. While turning to technologies such as cloud computing could be the answer for many businesses today, as data volumes continue to grow at an exponential rate, new and improved solutions may be required.

This is why developers at Harvard University have been working to develop new infrastructure that is able to cope with this influx of information and support critical research taking place throughout the institution.

James Cuff, Harvard assistant dean and distinguished engineer for research computing, said: "People are downloading now 50 to 80 terabyte data sets from NCBI [the National Center for Biotechnology Information] and the National Library of Medicine over an evening. This is the new normal. People [are] pulling genomic data sets wider and deeper than they’ve ever been."

He added that what used to be considered cutting edge practices that depended on large volumes of data are now standard procedures.

Therefore, the need for large storage capabilities is obvious. That's why earlier this year, Harvard received a grant of nearly $4 million from the National Science Foundation for the development of a new North East Storage Exchange (NESE). This is a collaboration between five universities in the region, with Massachusetts Institute of Technology, Northeastern University, Boston University, and the University of Massachusetts also taking part.

The NESE is expected to provide not only enough storage capacity for today's massive data sets, but also give the participating institutions the high-speed infrastructure that is necessary if data is to be retrieved quickly for analysis.

Professor Cuff stated that one of the key elements of the NESE is that it uses scalable architecture, which will ensure it is able to keep pace with growing data volumes for the coming years. He noted that by 2020, officials hope to have more that 50 petabytes of storage capacity available at the project's Massachusetts Green High Performance Computing Center (MGHPCC).

John Goodhue, MGHPCC's executive director and a co-principal investigator of NESE, added that he also expects the speed of the connection to collaborating institutions to double or triple over the next few years.

Professor Cuff noted that while NESE could be seen as a private cloud for the collaborating institutions, he does not expect it to compete with commercial cloud solutions. Instead, he said it gives researchers a range of data storage options for their big data-driven initiatives, depending on what they hope to achieve.

"This isn't a competitor to the cloud. It’s a complementary cloud storage system," he said.