How can you keep your data lake as clean as possible?

One of the key trends in big data analytics for the last couple of years has been the concept of the 'data lake'. The idea behind this is to place all a business' incoming data in a single location, from which it can be studied at will.

But while this may seem like a simple idea in principle, the reality may often be far different. If organisations are not careful with how they manage this, it can quickly become clogged with poor-quality information, irrelevant details and inaccuracies, which could see a firm's data lake ending up looking more like a swamp.

So how can this be avoided? In a new report, Constellation Research explained that many businesses fail to appreciate that a data lake should not be viewed as a replacement for a traditional data warehouse, which is able to support predictable production queries and reports against well-structured data. 

Instead, it noted: "The value in the data lake is in exploring and blending data and using the power of data at scale to find correlations, model behaviors, predict outcomes, make recommendations, and trigger smarter decisions and actions."

Where many poor implementations fail is if a business does not put in place a clear structure to order the data within their lake. There may be an assumption that simply deploying a Hadoop framework is enough to create an effective data lake, but in reality, this is not the case.

Constellation Research vice-president and principal analyst Doug Henschen, who authored the report, noted that despite it's name, it will be a mistake to consider a data lake as a single, monolithic repository into which data can be dumped without thought or planning.

Instead, they should look to split their data lake into 'zones' based on the profile of a particular piece of information.

"If Hadoop-based data lakes are to succeed, you'll need to ingest and retain raw data in a landing zone with enough metadata tagging to know what it is and where it's from," Mr Henschen wrote.

For instance, businesses should set up zones for refined data that had been cleansed and is ready for broad use across the business. There should also be zones for application-specific data that can be developed by aggregating, transforming and enriching data from multiple sources. A zone for data experimentation was also recommended.

This will not be an easy goal to achieve, Mr Henschen stated, as it will require businesses to pay much closer attention to data as it enters the company, as opposed to simply ingesting everything and then looking to categorise it later.

Although the Hadoop community has been working on a range of tools to help with the  ingestion, transformation and cataloging of data, many IT professionals are still not hugely familiar with these. However, Mr Henschen said there is good news on this front, as a broader ecosystem has emerged around Hadoop, aiming to tackle the problems associated with managing data lakes.

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *