Businesses looking to make the most of their big data initiatives need to ensure the information they are entering into their systems is as clean and accurate as possible if they wish to be certain the results they get are useful.
It was noted by Phys.org that this has become particularly important in an era where the size of data sets is growing and information is being entered into analytics software from a wide variety of sources.
Typical errors that are likely to be present in many companies' data sets include incorrect values, missing entries, aliasing (where information about two distinct entities has been merged in error, for example, because two people have the same name) and multiple entries – for example, where information about the same entity is split up because the name has been spelled differently for the same person.
Data scientists are well aware of the risks of basing decision making on unreliable or incomplete data, so many of them are investing large resources into data cleansing and preparation processes to find and fix any errors before analysis – a task that has been made much more difficult by the era of big data.
In the past, when data sets were smaller, analysts could manually examine and validate each entry, but in today's environment, this is no longer possible. Therefore, businesses will typically have to rely on computer-executed algorithms.
There are many reasons why inaccuracies may creep into data sets, with human error one of the most common. For instance, users often make mistakes when filling in web forms, such as typos in their addresses. This type of error can be easily fixed by algorithms that can verify whether postal codes match street addresses – but not all mistakes are this straightforward to remedy.
"One common place where errors arise is in linking data across data sets," Phys.org stated. "Unless both data sets use a unique identifier – such as a social security number – with each entry, it is challenging to match entries across data sets. There are likely to be entries that wind up linked even though they should be distinct, and entries that are not linked even though they correspond."
Businesses also need to be wary of data gathered from sources such as web forms, where users may feel more inclined to be dishonest. For instance, many people will enter false email addresses because they do not want to receive spam, and while some websites can get around this by using verification tools, this process can be expensive and inconvenient.
"People will provide correct and complete data only if they feel they can trust the data collection," the publication explained. It noted, for example, that the US Census Bureau is able to collect high-quality data because it is able to reassure citizens that what is reported in the census will not be used for tax collection or any other such government purpose, other than statistical reporting.
"The old truism 'garbage in, garbage out' is more apt than ever in this era of complex and gargantuan data sets – and the sometimes weighty consequences of trusting what they seem to imply," Phys.org said.