(Power by TFL Open Data, Contains OS data © Crown copyright and database rights 2016)
A few weeks back one of my favourite analysts, Merv Adrian tweeted the following:
““Just move it to memory and it will speed up.” Not so fast (pun intended.) Serious engineering required – even for a KV store. ”
I could not help but smile when I saw this. I’ve spent years telling anyone who would listen that putting data into memory doesn’t instantly transform software, originally written for disk-based data, to “in-memory”.
In 1988, at White Cross Systems (a pioneer in MPP in-memory systems, which later evolved into Kognitio) we set out to use the concept of MPP to build a database that would support, what we now call data analytics, but was then called Decision Support. Most databases at that time were designed and optimised for transaction processing, rather than decision support and so we were effectively starting from scratch. We wanted to build a system that was fast enough to support train of thought analysis and could scale linearly to support large and growing data volumes.
We never set-out to build an in-memory system, but it became clear to us early on, that if we wanted to exploit massive parallelisation, then we could not be limited by disk IO speeds. Reading data from slow physical disks seriously limits the amount of parallelisation you can effectively deploy to any task, as the CPUs (processors) very quickly became starved of data as everything became disk IO bound.
This is the most basic and important point that is often missed when talking about in-memory. It’s not the putting of data “in memory” that makes things faster. Memory, like disk, it is just another place to park the data. It’s the Processors or CPUs that run the actual data analysis code. Keeping the data in memory allows the CPUs fast access to the data, keeping them fed with data and enabling parallelisation.
For this reason we decided to build a system which kept the data of interest in fast computer memory or RAM (Random Access Memory). In retrospect this was a brave decision to make in the late 80s. Memory was still very expensive, but because we were rather young and naïve, we believed that the price would fall relatively quickly making the holding of large data sets in-memory, an economical proposition. Ultimately we were right, even if it did take a couple of decades longer than we thought!
The point I’m making is this. When we took the decision to go in-memory, it dramatically changed our code design philosophy. Not being disk IO bound meant we became CPU bound, so code efficiency became hugely important. Every CPU cycle was precious and needed to be used as effectively as possible. For example, in the mid 90s, we incorporated “dynamic code generation” into the software, a technique that involves dynamically turning the execution phase of any query into low level machine code, which is then distributed across all of the CPUs in the system. This technique reduced code path lengths by 10-100 times. I am not saying that advanced techniques like machine code generation are essential components of an in-memory system but I am saying that using an efficient programming language is important when machine cycles matter. So probably not JAVA.
Designing code specifically for in-memory also has another important benefit because, besides being faster, RAM is also accessed in a different way to disk. Disks are block access devices; data is accessed in fixed large blocks and moving between blocks is very time consuming. RAM, as the name suggests, allows data to be accessed a single word at a time from anywhere in the address space. Software algorithms that are written specifically to perform database operations on data held in RAM (joins, sorts, aggregations etc.) can exploit this more flexible access method to produce much more efficient code.
Then we come to caching. I can’t even begin to remember the number of times that, having explained Kognitio to someone, they have replied, “Oh so it’s basically a big cache.” No! Caching is NOT in-memory. A cache is a copy of the last-used disk data held in memory. The query optimiser does not know the data is in memory, so it can’t optimise for execution in memory. Even the execution engine running the query does not know the data is in memory, so it has to check by running expensive (in terms of time) code to ask the question, “Is the data I need cached?” All components of an in-memory database know that the data being queried is in memory. The optimiser, optimises for memory operations and the execution engine never wastes CPU cycles asking the “cached, not cached?” question. The Kognitio optimiser goes a step further and also uses the speed of in-memory to dynamically test the selectivity of complex predicates during query optimisation.
Another difference between in-memory and a cache is the degree to which the user / administrator can control what data in in memory. For example, to optimise for different workloads, in-memory systems, generally support a variety of strategies for distributing the data across the available processor. A cache does not have this flexibility. A cache will also automatically load the most recently used data into memory and in the process remove other data. This leads to unpredictable query performance. An In-memory system on the other hand allows specific data to “pinned” into memory thereby providing predictable query response times.
Using the term in-memory for caching is, I suppose, somewhat understandable but I have, on the odd occasion, actually heard people say that something is in-memory because it is using solid state disks. I have real difficulty with this claim. Obviously my earlier points about code efficiency etc. applies equally to solid state disk drives, but the biggest issue with these devices is that, although they are made from fast RAM, they are designed to behave like traditional disk drives. To make them compatible with existing operating systems, a solid state disk drive uses the same interfaces as a traditional disk drives and so they are block access devices. Although a solid state disk is considerably faster than a traditional spinning disk, it is nowhere near the speed of RAM and cannot support the more efficient RAM based algorithms mentioned above.
In conclusion, software needs to be designed for in-memory operation to get the full benefit from this drastically different media. Porting code that was originally designed for disk based operation can undoubtedly provide improved speed, but with analytical platforms experiencing ever growing data volumes and increased workloads, it is crucial that the available compute resource is used as efficiently as possible. When in-memory, code efficiency matters.