Take care before diving into the data lake

With growing expectations surrounding big data analytics and what the technology can do for a business, one solution that an increasing number of companies are looking at is the concept of the 'data lake' as a repository for their information.

As both the volume of data and the variety of sources is on the rise, being able to store, access and analyse these details efficiently is a top priority for businesses and the data lake is often identified as the solution to this.

Nick Heudecker, research director at Gartner, explained that in broad terms, data lakes are being presented to firms as an enterprise-wide data management platform that is able to assess information in its native format.

"The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format", he explained. "This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it's available for analysis by everyone in the organisation."

But while the principle of data lakes may be appealing to many companies that are currently struggling to get a handle on their growing amounts of data, Gartner warned there are many pitfalls involved in this for the unwary organisation, so they make take care not to get swept up in the hype and evaluate carefully what they can expect to achieve should they go down this route.

One of the biggest potential issues with such a solution is that it assumes a certain level of knowledge from users. Mr Heudecker explained the tools will expect that individuals will be able to recognise the context for the data, that they know how to merge and reconcile different data sources and that they understand the incomplete nature of datasets, regardless of structure.

While this may well be true for those with the right skills and experience, such as data scientists, the majority of business users will lack this level of sophistication, and developing these capabilities can be expensive and time-consuming.

Therefore, firms need to question what capabilities they have available before they plunge head-first into a data lake solution. If such a move is made without full consideration of what skills and support structures are available – both internal and external, such as a big data analytics partner – this will increase the likelihood that a big data project will fail and not provide the expected value.

As such, a better option for many firms may be to test the waters of the data lake with small-scale testing, rather than moving all their data from familiar silos to a single repository in one go. This will not only help firms gain a better idea of what results they may be able to achieve with such a solution, but also identify where the gaps in their understand lie that will need to be addressed.

Mr Heudecker added that while data lakes typically begin as ungoverned data stores, this is unlikely to be an end in itself. He said: "Meeting the needs of wider audiences requires curated repositories with governance, semantic consistency and access controls – elements already found in a data warehouse."