Data quality: How do you keep data accurate and timely?

How damaging can an out-of-date phone number or a duplicated, misspelled name really be? As most data architects will confirm, poor data can be a very costly issue for businesses; a recent study from Gartner suggests the average financial impact of poor quality data costs organizations USD 9.7 million per year. Meanwhile, a report by the Royal Mail highlights that 6% of UK businesses’ annual revenue is lost due to inaccurate data.

There are two key aspects to data quality; data cleansing and data governance. Both are integral to ensuring your business users can garner accurate, intelligent metrics from their data, but they serve different functions.

According to a recent Big Data Survey, three in four organizations face challenges with keeping their data lakes synchronized with ever-changing data sources, resulting in poor quality sets. With so many sources of data, it can be an expensive issue that is increasingly hard to solve.

So how can businesses ensure their business data is valid, and avoid flushing revenue down the drain to out-of-date information?

Does data integration address accuracy?

Data quality — the accuracy of your stored information — is a core function of Extract Transform Load (ETL) services for data integration; ensuring addresses make sense, fields are unique, cross-checking for duplicated entries and misspellings. These data quality issues can be dealt with at the point you transfer your data from one source to another.

An ETL tool can perform the majority of the data quality functions; the ‘Transform’ step modifies data to fit integrity constraints or standards of the receiving system.

On top of this, however, modern organizations create data across a vast number of business functions. You may need to combine stock inventory from platforms like SAP with HR data, and this with finance and accounts data. To add more complexity, there’s now a spotlight on social media data and organic brand authority; businesses want a 360 degree view.

For this information to provide any correlative value, these data sources need to be unified and centralized.

Do you have the most current information?

Unlike data cleansing, data governance is the overall availability, usability, integrity and security of data used in your enterprise.

“Governance is as much about using the wisdom of the crowd to get the right data to the right person as it is locking down the data from the wrong person.”
Ellie Fields, Sr. Director of Development, Tableau

For example, you’ve cleaned your data, you know it’s in excellent shape with an accurate record for each field, sitting in right format. But how do you make sure this dataset — potentially being used to make high level strategic business decisions — is the most current?

The most common headache for data architects handling such vast datasets stored across multiple locations is age of data. When information is extracted from its central source, subsequently saved into any number of personal BI tools or downloaded as Excel exports, there can be multiple layers of modifications and inconsistencies.

These datasets can go through endless iterations; it can be impossible to trace where your information was originally pulled from, and for what purpose. This can entirely invalidate your business users’ insights, and cause serious financial drain to the organization.

“Data governance is assessing where your data originated, deciding if it’s safe to use, how up-to-date it is, and whether it is certified as accurate for the company to use.”
Roger Gaskell, Chief Executive Officer, Kognitio

For example, multiple executives could conclude that their segment of the business is profitable, using very different means of calculating that metric. How do you ensure that this process is managed, and all branches of the organization are using the correct data?

Your starting point could be a data quality audit, to assess the current condition of your database. Then, you need to ensure your users use common analyses, or the intelligence they obtain could be meaningless.

For this common means of business analysis, you need a central data repository, like Hadoop.

You need one version of the truth

Data governance is critical because more organizations rely on their data to make business decisions, optimize operations, create new products and services, and improve profitability.

The principal corruption to data quality and governance is users acquiring and utilizing datasets in inconsistent ways. In Hadoop, you can pull together your disparate data sources, and create one central data lake. This gives you one version of the truth.

“The beauty of Hadoop is that it brings all the data from your systems together in its raw, initial state, and stores it to a Hadoop cluster. Then, within Hadoop you can process it up to a validated, clean, and cleansed dataset that is suitable for analysis.”
Roger Gaskell, Chief Executive Officer, Kognitio

Most data integration tools allow for data cleansing directly in Hadoop. This means your users have the raw source data alongside the clean and cleansed ‘gold layer’ of data, ready for analysis.

The missing piece of the puzzle, however, is letting your business users interact directly with this gold layer. Software like Kognitio gives your business users the flexibility and speed to get their bespoke insights directly, without needing to pull the data out of the Hadoop cluster.

Find out more about maintaining your businesses’ data governance by letting your users get faster answers directly from your data in Hadoop, and potentially saving your business millions.

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *