data lake, big data

By now, big data analytics has well and truly passed the hype stage, and is becoming an essential part of many businesses’ plans. However, while the technology is maturing quickly, there are still a great deal of questions about how to go about implementing it.

One strategy that’s appearing more often these days is the concept of the ‘data lake’. For the uninitiated, this involves taking all the information that a business gathers – often from a large range of wildly diverse sources and data types – and placing it into a single location from which it can be accessed and analysed.

Powerful tools such as Hadoop have made this a much more practical solution for all enterprises – so it’s no longer just something limited to the largest firms. But what will it mean for businesses in practice? Understanding how to make the most of a data lake – and what pitfalls need to be avoided – is essential if businesses are to convert the potential of their data into real-world results.

One repository, one resource

One of the key features of a data lake is that it enables businesses to break free from traditional siloed approaches to data warehousing. Because the majority of the data a company possesses is available in the same place, they can ensure they have a full picture when building analytical or reporting models.

As well as more certainty in the accuracy of results, taking a data lake approach means businesses will find it much easier and cheaper to scale up as their data volumes and usage grows. And not only is this scalability cheap, a strong data lake is capable of holding amounts of raw data that would be unthinkable for a traditional data warehouse.

A data lake will be particularly useful if a business is dealing with large amounts of unstructured data or widely varying data, where it can be difficult to assign a specific, known set of attributes to a piece of information – such as social media or image data.

Clear results need clear waters

However, in order to make the most of this, businesses still need to be very careful about the information they pour into their data lakes. Inaccurate, vague or outdated information will end up seriously compromising the results a company sees.

In order to deliver effective insights, the water in your data lake needs to be as crystal clear as possible. The murkier it gets, the worse it will perform. Think of it this way – you wouldn’t drink the water from a swamp, so why should you trust the results from a dirty data lake.

This concept – most famously known as garbage in, garbage out – can dictate the success or failure of a data lake strategy. Therefore, while it may be tempting to simply pump everything into a data lake and worry later about what it all means, taking the time to assess and clean all the information first will pay dividends when the time comes to review analytics results. Even something as straightforward as ensuring everything has accurate metadata attached can make a big difference when it comes to wading through the lake to qualify data for use in a study.

Making sure you don’t drown

Resisting this urge to simply dump data into a data lake, hoping that the sheer volume will overcome any shortcoming in the quality, will be one of the key factors in making a data lake strategy a success. But it’s not the only philosophy for a strong solution.

With so many sources of data in the same place, it’s easy to become overwhelmed and end up drowning in all the information potential. In order to get quality results, users must make sure they are asking appropriate questions of the right combinations of data. The more focused and specific analysts can make their queries, the more likely they are to see valuable outcomes.

Meanwhile, other risks of the data lake include the varied security implications. With so much data coming together in one place, they may be tempting targets for criminals, so strong protections, access controls and audit are a must.

As one of the most hyped technologies of the big data revolution, many firms will be interested in what data lakes can offer. But, as the old adage says, it’s important to look before you leap – otherwise you could be diving head-first into a world of trouble.

Kognitio Analytical Platform

Analytics for the Data Lake

Read more