About

About Kognitio

Kognitio evolved out of two businesses – Kognitio and White Cross Systems – which merged in 2005. It was responsible for releasing the world’s first fully functional, scaled-out, in-memory analytical platform in 1992.

Our vision

 

We set out with a simple vision: that scale-out computing or Massively Parallel Processing (MPP) was the answer to the then emerging problem of how to analyze, in a timely manner, rapidly growing data sets. We realized early on, that massive parallelization required that the data be held in fast computer memory.

Historically, the high upfront cost of installing a dedicated hardware infrastructure meant that Kognitio remained a niche product for many years, however the emergence of Hadoop has provided us with a ready-made platform that can be shared with other data processing / analysis technologies.

We already had an existing fully functional, fully parallelized, mature analytical platform – we just evolved it specifically for Hadoop. For years, we’ve been developing parallel SQL.

What sets Kognitio apart?

There are many SQL on Hadoop technologies out there, so you might well ask what’s special about Kognitio. Well, put simply, there are large variations in the performance, flexibility and maturity of available SQL engines.

 

Hive, Impala and SparkSQL, for example, are new SQL implementations that were developed from scratch for Hadoop. Yet SQL is a very large, complex standard. It’s difficult enough to implement on a serial platform, but to implement it in parallel is mind-bogglingly difficult, and time consuming. Full parallel execution of the SQL functionality is important, because it allows products to scale.

For the past 25 years, Kognitio has been deployed on an infrastructure platform of clusters of industry standard servers – exactly like Hadoop’s. So we simply evolved the established Kognitio platform to work on Hadoop, and what’s more, using a mature SQL implementation means that it is more likely to have solved the complex issues around concurrent workloads.

Today our focus is to provide ultra-fast access to big data, especially for users using BI tools like Tableau or MicroStrategy, who need to make fast-paced decisions for their business. We enable these BI tools to maintain interactive performance, even when the data volume is large and the user count is high.

Our innovation

Early to mid 90s

v3 / v4

  • First in-memory database as custom made hardware appliance
  • Uses code generation for expression evaluations and some block operations like projection
  • Cost based optimizer
  • Compressed data maps to optimize block fetches from disk
  • ‘Range records’ for dealing with large groupby/sorts where the intermediate overflows RAM
  • Full ACID multi-statement transaction support with fully updatable memory images mirrored in disk storage
  • View image’ concept allows creation of memory-only query snapshots which can be defined queried as views
Late 90s

v5

  • Moves to x86 custom made hardware appliance
  • Adds code generation for larger block operations (joins, sorts), results in 2-5x speedup of common operations
  • Adds plugin UDF capability
  • Adds code joins for efficient handling of lookup tables
  • Adopts Linux kernel
Early 2000s

v6

  • Moves to run on appliances made from Third Party x86 based hardware with Linux
  • Switches to recomputation based query streaming engine to deal with queries that don’t fit into RAM
  • Intermediate results stored in discardable memory caches to eliminate unnecessary recomputations
  • Dynamically resizing buffers improve the ability of long running queries to adapt to changing workloads on the fly
  • Adds ‘full operation’ code generation to generate entire operations as single efficient functions. Doubles performance for most operations
  • Adds ‘bushy query plans’ to optimizer
  • Adds query queues to improve management of heavy concurrent workloads
Mid 2000s

v7

  • Expands to offer software-only product for installation on third party clusters
  • Adds support for Window functions and grouping sets/rollup support powered by the core code generation engine instead of by query transformations
  • Adds high speed data load/unload capabilities supported by code generation to speed data conversion. Parallel splitting of CSV streams.
  • Parallel streaming full-function-code-generated merge-sort operation speeds 2-step grouping operations with hint records to eliminate stalling reduce memory overhead and reduce recomputations
  • Implements high speed UDP based message passing fabric for better scalability and performance
  • Slab-based memory management to speed memory management and eliminate memory fragmentation under heavy loads
  • N-Way joins as single code-generated functions
  • Is capable of running 96 TPC-DS queries well and the other 3 slowly
  • Improves multiple distinct operations by transforming into grouping set operations
  • Disk repacker removes the need for offline maintenance windows to recover space from deleted records or add new storage
  • Adds arbitrary character set support
Early 2010s

v8

  • Adds memory image re-use
  • Adds parallel external table connectivity
  • Adds parallel external scripting interface for high speed ‘Not Only SQL’ tasks
  • Adds first Hadoop connectivity with HDFS and push-down capable map-reduce connector
  • Adds high speed backup/restore and migration features with parallel compression/decompression powered by code generation
  • Incremental backup feature which can also be used for asynchronous data replication between servers
  • Data sampling improves runtime stats gathering, speeds compilation
  • NUMA awareness, cache prefetching and other related techniques
  • String tokenization speeds groupby operations
  • Adds variable-format records for more efficient grouping set operations
  • Performance tune up and optimization. Doubles the performance of most workloads
  • Partial groupby improves performance and allows optimizer to add extra duplicate eliminations without worst-case problems
  • Pre-joins introduces memory images which contain pointers to related rows in other tables for super-fast joins
  • ‘Disk train’ optimization arranges disk scans to eliminate repeat disk block reads
  • Partitioned and compressed memory images
  • Is capable of running all 99 TPC-DS queries well
  • Optimized paths for short queries and parallelism. Short queries can run in under 0.1s
2016
 

v8.2

  • Releases Kognitio on Hadoop
  • Adds support for JSON and AVRO file formats. Connectors for ORC, Parquet
  • Intelligent parallelism
  • Concurrency optimizations improve scale-out for heavily concurrent workloads. Can handle 100s of sessions and 1000s of queries/second