To explain what we mean by “complex”, it helps to think of most database queries as having two distinct steps:
Step 1: The “filter” step – find the subset of data required to fulfill the query.
Step 2: The “crunching” step – where the filtered data is processed to calculate and present the required result.
In general, simple queries are dominated by the filter step with the crunching step being a trivial operation e.g. simple count of widgets for a time period. For complex queries, the crunching step is predominant, lots of CPU is consumed by joining, calculating, aggregating and filtering aggregates across many steps.
Because of this inherent ability to perform well for complex queries, Kognitio is an ideal platform for Advanced Analytics. Advanced Analytics involves applying advanced algorithms to data to try and gain more meaningful insight, including predictions about what might happen in the future, as opposed to simply using the data to report on what has happened in the past. These algorithms involve lots of heavy-duty data crunching and are CPU-intensive operations.
The problem with Advanced Analytics is that many of the algorithms used are difficult or impossible to express in SQL (a set based language rather than a procedural language). These operations, therefore, have typically been performed externally to the data source or database. Extracting the data from the database to run it through an external algorithm is a painful process at best, but with large data volumes, it becomes so expensive in I/O terms as to be totally impractical. The biggest challenge of all is that to process large data volumes in a timely manner often requires multiple copies of the algorithm to be run in parallel on carefully defined chunks of the data, a very complex and difficult exercise for any organization to undertake.
The solution adopted by some is to throw away SQL altogether and invent another way of querying data e.g. NOSQL and MapReduce. This is very much an engineering led approach to the problem and ignores the fact that SQL has many important strengths, as well as being the de facto standard for most BI tools and data driven applications. Not having SQL access to data severely restricts the people within an organization who can freely interact with this valuable information resource.
Other vendors have chosen to embed specific analytical algorithms directly into their database. Some have done this in a way that allows parallel execution of the limited algorithm set. Others have simply added a way of calling-out to an external process, but that becomes I/O bound for larger data sets. In some cases vendors restrict the languages that can be used to create these algorithms.
Kognitio has taken a different approach by allowing any script or binary that can be run in a Linux environment to be utilized by the platform e.g. R, Python, Perl, Java, SAS script, C, Fortran, custom scripts etc. This feature is called “External Scripts”. As long as the code can accept data in (stdin) and send results data out (stdout), Kognitio can execute it in-place, within the platform, via code in-line ‘within’ the user’s or application’s SQL query. The code is automatically executed in a massively parallel context with one distinct copy of the code running on each and every CPU core by default; each processing a sub-set of data controlled by partition statements in the query.
With some simple additions to the SQL syntax, Kognitio allows users to easily control the number of parallel code executions, data partitioning, data sequencing and break points in execution.
Output from all scripts is merged into a single virtual table that continues into the next stage of the controlling SQL query execution plan.
By taking this approach, Kognitio does not limit the analytics that can be run in-platform to those specifically supported by Kognitio, opening up an amazing freedom of choice and capability. By using SQL as the management wrapper, business users can easily control the process and visualize the results using standard BI applications and tools through traditional interfaces such as ODBC/JDBC.