With the release of Kognitio version 8.1, the Kognitio Center of Excellence (KACE) team have been looking at exploiting the new capabilities on some interesting analytic problems.
As analysis becomes more complex and the associated data sets become larger, the computational workload required is increasing just as quickly (if not more so) than hardware capability.
The overriding theme through our research is: how do we breakdown the workload so that we can parallelize the computation in order to exploit the full power of the system available?
There are many ways to exploit parallel computation to solve large problems, but for KACE there is the additional important requirement that our approach should be accessible to users of Kognitio so they can apply them to their own specific requirements.
Below is a quick overview of some of the interesting problems we’ve been working on. Let me know what you think.
In the area of web analytics, Ross has been working with an existing client investigating the automated production of analytical output that can be used in some complex visualizations. This involves a number of steps. Firstly data prep: this is straightforward to parallelize as it consists of row-based operations. However, the nature of the data parsing of urls means this is not suited to SQL, so Ross has done this using Python. The parsing script is then executed in parallel across the system. Once the data is clean, he has carried out some analytics looking at traffic and journeys. Ross is now working on the parallel production of JSON files that drive the visualization. We are utlizing examples from d3js .
Peak Forecast Analysis
Zoe has been implementing work done by academics from Reading University on forecasting methods for volatile data where predicting peaks is essential. This has applications in the area of smart metering and forecasting household level electricity demand. Traditional forecasting techniques, employing as mean absolute error or root mean square error penalize forecasts where peaks are of the correct size and amplitude but displaced in time. Clearly this is not desirable when forecasting electricity demand. This is overcome in the approach employed by Reading . Increasing the complexity of the forecast to account for time displacement requires more computation than traditional approaches to produce a single forecast. However, the need to repeat the forecasting over thousands of households lends itself perfectly to parallelization. This is where Zoe (and the Kognitio Analytical Platform) comes in. She has implemented the approach using MPP in-memory R  and is running comparisons with more traditional time-series techniques such as Holt-Winters . We may also look at a C++ to compare performance and plan to investigate the performance as we scale out over more households.
Super-charging “traditional” analytical models
Clustering and regression analysis are the work horses of the analytical arena. Scoring existing models within databases has been possible for some time and Kognitio is no exception here, but KACE members, Chak, Sunil and Tasneef, are looking to bring the power of MPP processing to the modeling phase of the problem.
If you have many models to run on a regular basis then the productivity gain from having the ability to parallelize them is clear, but how can Kognitio help for a single larger problem?
The first issue often encountered in both regression and clustering (if not all) modeling techniques is data prep. Clearly having turbo-charged SQL query performance helps here, but we want to take things further. Tasneef is working on code to automate data transformation tasks, while Chak is developing some additional tools around attribute selection: coding MPP coarse classification  in order to allow you to investigate which of your hundreds of attributes you should employ in your models. He plans on investigating the implementation of principal component analysis next.
For clustering, there are some obvious areas where MPP can help. Assessing the robustness of clusters produced can be a slow process as you have to run the cluster generation process many times with random starting points. Clearly this is something that can be sped up with multiple iterations of the model running in parallel. Fast, easy access to these results allows models to be assessed efficiently by the modeler.
Kognitio’s ability to easily control the partition of data into different code invocations means we could easily build clusters of different samples of the data set in parallel, which is another way to test the robustness of the cluster model.
There is also the question of how many clusters should I use in my model . However you approach this decision you often need to run your clustering algorithm for different numbers of clusters and compare. KACE plan to bring MPP to bear here, allowing modelers to assess their options more expediently. Tasneef will be looking to parallelize this using Python and C.
For regression, it is more difficult to see how to implement an MPP approach.
If you are doing any stepwise process then within each step the addition (or removal) of attributes can be done in parallel yielding some improvement in performance for each step. Namely, from the results of the previous test, new additional variables can be tested in parallel.
If you are trying to carry out regression over a large data set either in terms of the number of observations, then creating multiple regression models using different samples (i.e. bootstrapping ) can be utilized to build confidence intervals for various regression statistics or coefficients. If you have a large number of attributes to choose from, then parallelism can also be brought to bear. KACE will be looking to implement some of these techniques exploiting Kognitio’s MPP.
With the massive increase in the use of social media data, such as Twitter and Facebook, network analysis is of increasing interest to business. Who is talking about your product/brand/advert? Who are they interacting with? Who should I market to in order to get market penetration? These are the same traditional marketing questions, but with more data to hand and therefore more information about social relationships. The starting point is the ability to build meaningful networks. Kaustubh has been working on utlizing the power of Kognitio’s in-memory MPP to help do this more efficiently. This is actually a difficult problem to solve, as potentially everyone is linked to everyone else in one large network. What is the most efficient way to build smaller networks in parallel then combine them? How do I divide the data for parallel processing to minimize the number of iterations I have to go through to obtain the full network picture?
In closing, we hope you enjoyed learning a bit more about the KACE team’s current parallel computation problems surrounding data visualization, peak forecast analysis, supercharging “traditional” analytical models and network analysis. Please contact Dr. Sharon Kirkham and the KACE team at KACE@kognitio.com should you have any questions. Credit Scoring and its Applications, Lyn C. Thomas, David B. Edelman, Jonathan N. Cook, 2002. Pg: 131.