Data, whether it’s Big or small, is naturally row-based. That’s not a technical revelation. Quite the contrary; it’s how people think. If you’re having dinner or drinks with a group of friends after work, you see each member as an individual; you know their first and last names. You may know the names of their spouse and children, as well as other information about them. But all of that information naturally coalesces into a virtual container about that person. It’s simply a natural way of thinking about things.
That’s how source systems organize and collect information. Much ado has been made around analytical engines about speeding performance by reorganizing the data into columns, instead of naturally occurring row-based structures. The columnar-based approach promises fast query speeds on vast amounts of data, and it delivers on that promise, but at a significant cost that is no longer worth it. Here’s why.
Column-based data is not a new concept. Mainframe systems used some element of storing data in columns back in the 1960s, on purpose-built platforms for specialized operations. It was essentially used to enable fast retrieval of data by creating an index, just like an index in a book. Using this index, again as someone would do with a book, fragments of subjects and mentions of concepts were referenced and pointed to in another logical order – e.g., alphabetically, as opposed to the chapters in which they were originally organized.
Think about any transaction of which you have been a part; it may be buying something at a retail store, taking money from an ATM, even booking a suspected criminal (although I hope you’ve not been part of that): Every one of these transactions follows the same format: “SMITH, JOHN; 26 Queens Boulevard, Apartment F; 10129; BURGLARY; May 26, 2001; 10:30pm ET,” and so on. Again, this is row-based in its approach, referring to one individual and his complete record.
To achieve their speed, column-store database systems essentially create many, many indexes – so many indexes, in fact, that they themselves become the database. Rather than being a new and better “size” alternative in a “one-size-fits-all” world, columnar databases perform unnatural acts of indexing on row-based data stores for one primary reason – to reduce the number of query-slowing input/output (I/O) calls against spinning hard drives; which is very much a 2005-type problem. Eight years later, we submit there’s a better way to approach this challenge, especially in the era of Big Data.
Spinning disks – the venerable hard drive
At one time, hard drive spindle speeds were a serious data analysis bottleneck. While processors and CPU cores gained speed by leaps and bounds, spinning-disk hard drives lagged in their ability to quickly find and read data. Higher capacity drives only exacerbated the speed gap between disks and processor chips. Solid state disk (SSD), while much faster, offered little relief as it was low in capacity and high in price.
Because this problem persisted for years, the cumbersome data manipulations and associated IT complications of columns seemed worthwhile. But technology, like time, marches on and the calculus has changed – significantly.
Today, even low-cost commodity server platforms (aka “industry-standard servers”) have access to terabytes of lightning-fast Random Access Memory (RAM), making analytic reads from disks strictly optional. Executing through hyper-threaded, multi-core processors, a new generation of in-memory, massively parallel processing databases – architected from the ground up to take advantage of abundant RAM – enable row-based databases to deliver the query performance of column-based databases with none of the cost and complications brought on by the previous
sly unnatural acts of data mutilation. But that’s not the only advantage.
Limiting your options
Columnar databases impose a subtle, but very real, constraint on query parameters. To achieve its goal of limiting disk I/O, columnar databases query against what are essentially indexed summaries of the original row-based data store. Each row of data must be first split into its component column values and each of those values must then be written to a different place within the database. These indexing schemas must be correct for them to work properly – which can require multiple iterations to determine formats – and even if they work properly, a close examination of the process reveals that user queries are frequently limited by the index scheme itself. Without access to the original data in its original form, true ad hoc querying of the data is not possible. Rather, users are restricted to queries that conform to categories of comparison, called projections, anticipated by the original indexing.
In today’s world, that simply no l
onger works. People think of the oddest questions, and column indexing cannot cover each and every scenario. This is especially true as petabytes of additional data are added every single day; business analysts in the “Facebook Generation” know that every click, fact, status, sensor, etc. is tracked, stored and should be available for processing. Those analysts, who are not now (nor will they ever be) Database Specialists, write ad hoc queries in their favorite Business Intelligence tools, in standard user interfaces, or even just in Excel that do not conform to a standard that is friendly to any locked-in schema.
This columnar requirement to alter the original data structure introduces other practical issues, not least of which is operational latency when attempting to conduct more complex queries to perform more sophisticated analysis. For example, imagine a retailer using a columnar data warehousing system wants to run complex market basket analysis-type queries on a large data set, say something more than 5TB. Due to multiple fact tables and complex joins, it can take days or longer to get a columnar database properly set up, since the schema must be constructed multiple times to get it working right before data can even be loaded and analyzed. In a world where insights are increasingly required on an immediate, “need it now” basis, this presents companies with an untenable situation.
Further, updating the data warehouse with fresh data is not a straight-forward process, causing columnar database vendors to employ complicated tricks to do updates in a reasonable timeframe. Overall, from an IT management perspective, columnar data makes life complicated. Complexity is costly. For companies seeking the agility and advantages of near-real time analysis, this type of latency between data collection and data analysis is a real problem only exacerbated by the information fire hose effect that is “Big Data.”
In addition, the need to index and project can significantly diminish another one of the hyped benefits of columnar databases – compression. Because they develop indexes of the actual data, columnar databases are touted as providing exponential levels of data compression; seemingly, a very attractive proposition for companies dealing with massive amounts of information. What’s less publicized, however, is the effect that creating multiple indexed projections has on this benefit. As data sets grow larger and more complex, the need to perform more complex queries scales along with them. This, in turn, multiplies the number of projections that are created. Fairly quickly, this can significantly reduce the initial compression benefit of indexing. In fact, many of the columnar database purveyors recommend having as much disk as the uncompressed data for this very purpose.
An Example: on-line gaming analytics
A leading online gambling operator in the UK wanted to up its player analytics game. They were initially interested in columnar database technology because of the perceived uniqueness of high compression rates.
In order to accomplish what they wanted, data would have to be duplicated over and over in order to build so-called “projections” in the database. In this way, the benefit of having a high level of compression was lost as they still had to have as much disk available as the total amount of uncompressed data.
Dust in the wind?
So, as spinning disks and other limitations they grew up with fade in the face of continued technological progress, will columnar data analysis disappear? In a word, no – it’s more likely columnar will become a feature or capability within a larger, more capable solution.
For instance, in applications where a careful effort has been made to tune the data for query performance and there is a need to repeatedly run the exact same set of real-time queries, columnar indexing may make sense. They are, however, the exception…and not the rule. More typical for businesses grappling with getting the most out of their Big Data investment is a scenario where a broad range of user types seek the answers from ad hoc, often fairly complex questions, against at least near-real-time information. For the reasons explained above, adapting a columnar architecture as the data warehouse engine in such a scenario poses significant costs in terms of operational cost and complexity.
So why go to all the trouble of putting data into columns if you don’t have to? Columnar databases were invented to solve yesterday’s problems. It’s time to look forward.
In-memory analytics: the cure for the common column
By comparison, an in-memory analytical platform can augment any data storage infrastructure and maintain the original row-based data structure while letting multiple users run queries at train-of-thought speed. Unbounded by the constraints of columnar indexing, users are free to explore any and all possible relations present within the data. Further, because the data structure is preserved and information passes quickly and easily from collection point to data warehouse, users are assured they’re working against near real-time data that accurately depicts the current lay of the land. This is often referred to as “performance at the glass,” and reflects the immediacy that drives many companies’ analytical needs today.
The future is always arriving
Even now, industry leaders Intel and AMD are reportedly working on new CPU technology that would enable 46-bit address spaces, overcoming a longstanding limit and thus allowing up to 64 terabytes of addressable RAM on a single server.
Memory technology itself is being pushed toward exciting new advances with direct application in and benefit for data warehousing and analytics. Dynamic RAM today is fast and getting faster on a regular basis, but it’s a volatile medium. If you lose your power, you lose your in-flight data – making disks a necessary safe harbor for persistence. But that present-day reality appears poised for a slide into the rearview mirror as years of research into different forms of non-volatile RAM (NVRAM) appears poised to alter the commercial landscape for fast, persistent, enterprise-class memory. The NVRAM just ahead in the commercialization pipeline will be a significant leap beyond the flash memory of today, which, though offering faster performance than spinning disks, is not up to DRAM speeds and suffers from data reliability issues under constant, heavy use. The next generation technologies for NVRAM, such as phase-change memory (PRAM), are promising to deliver something very close to universal memory; offering performance that eclipsing both RAM in speed and spinning disks in data durability.
Make no mistake; these technologies will not arrive via special delivery next week – or even next year – shrink-wrapped and ready for deployment at scale at fire sale prices. The trend, however, is clear and inexorable toward more and persistent memory, more efficient use of increasingly capable multi-core CPUs and increased bandwidth tying these platforms together. For these reasons, loading data in-memory on an analytical platform that augments an existing infrastructure represents a far superior solution for today – and tomorrow.
Find out more about how the Kognitio Analytical Platform provides scalable in-memory processing for advanced analytics at www.kognitio.com/analyticalplatform