Companies performing ad-hoc big data analytics operations have been reminded of the importance of keeping the data used in the process after it is completed.
Speaking at an IT Leaders Forum organised by Computing.com, director of file, object storage and big data flash at IBM Alex Chen explained businesses may need to refer back to this information at a later date. This may be in order to meet regulatory requirements, or simply if people want to investigate what happened and why a particular decision was taken.
At the moment, many organisations are still in the early adoption stage when it comes to big data, which means they may be performing a large number of experimental and ad-hoc analyses as they learn how to bring this technology into their everyday operations.
Mr Chen said: "It's likely that someone in a line-of-business [in many organisations] has spinned-up a Hadoop cluster and called it their big data analytics engine. They find a bunch of x86 servers with storage, and run HDFS."
Many people tend to throw away this data after it has been processed in order to keep their system running efficiently. Mr Chen noted that even in these ad-hoc deployments, it is not terabytes, but petabytes of data that are being ingested, and the more data that has to be analysed, the longer it will take.
But while deleting this data may keep analytics processes running as fast as possible, it could mean businesses have no answers when they need to demonstrate what led them to their final decision.
"Performing analytics generates a lot more meta-data, too, and due to regulations or business requirements people may just want to see what happened and why they made certain decisions. So you will need to re-run the analytics that were run before," Mr Chen continued. "So you can't just throw away the data any more."