Chak Leung – KACE data scientist
There are a plethora of mathematical techniques openly available to us with open source code easily available and packaged, ready for us to exploit. However, as Data Scientists, we often have favoured algorithms and it is easy for us to become stagnated. Additionally common approaches to data parallelisation can have their drawbacks and if our choice of mathematical technique doesn’t mesh well with this, what can we do about it?
At Kognitio we believe it is essential to incubate new ideas and like to reach out to academia where new techniques are being developed everyday but may be underutilised due insufficient data/infrastructure, something we have plenty of.
Venue and partner
In late August the KACE team went to the STOR-i centre for doctoral training at Lancaster University for a one day workshop event where we presented some of these problems.STOR-i (Statistics and Operational Research) is a joint venture between the departments of Mathematics and Statistics and Management Science offering four year programmes with industry partner involvement.
The workshop was an excellent introduction to practical “big data” problems for the attendees including undergraduates, PhD students and post-doc researchers and gave Kognitio an opportunity to get fresh input into existing clustering problems including initialising clusters and amalgamating cluster models produced in parallel.
When data parallelisation isn’t enough?
Why is parallelising algorithms a problem? Take, for example, the K-means clustering algorithm where we might need to ask ourselves:
- Can I fit all the data within the memory of one single model?
- Would the extended run time be significant?
- Could the initial clustering start points be better placed for fewer iterations reducing the run times?
- Is there a way to combine all the parallel models into one at the end to combat the extended run time?
Data parallelisation answers the first two questions; we can duplicate the algorithm across threads and bypass the requirement of fitting all the data into memory on one node by fanning the data out and with parallel platforms, such as Kognitio, this also massively reduces execution times as well.
But this causes problems: how can we implement synchronised smarter start points and how do we combine the results from all the threads?
In the example above notice that if we simply split the data up and create a model in each thread we generate three models but really what we are after is one model. Here the K-means problem becomes much more complicated as the movement of clusters (from the initial start points) in one thread are not known to the other threads. This means there is the possibility of very different final cluster results for each thread in our parallel algorithm.
There are existing methods of model aggregation such as bootstrap aggregation or even taking smaller random samples to avoid parallelism altogether but these have their own issues they are based on estimations with no certain guarantees.
Thus, in search of a new and better solution, the challenges that we proposed to the attendees at STOR-I were:
- Can we define some metrics or aggregation procedures so that the parallelism can be kept intact and results combined?
- Could the steps of the clustering algorithm be parallelised so that every node can be simultaneously exploited to contribute to the results of each stage?
The main focus points were initialisation: better starting points than just random ones and the aggregation of cluster models to produce one large one.
Every group recognised that initialisation of the cluster starting points were important leading to many efforts in this area. Their ideas included
- Initialising clusters based on variations from global means uniformly distributed on a hypersphere. This involved setting the initial clusters closer to the global mean of the data and projecting to areas where they would fit best. It was acknowledged distributing clusters would be difficult at higher dimensions than the presented 2-dimensions.
- Genetic algorithm – split up the data initially where each centroid candidate is compared with the current selection, removing the lower quality ones in favour of potentially higher quality centroids. The problem here is that this might be computationally expensive for an initialisation step as with many heuristics.
- Produce more clusters than needed – overfitting. Split the data into N threads trying to put closer data together, i.e. in the same thread then on each thread to over fit the data. Later iterations or aggregation could then bring similar clusters together.
There were some interesting attempts to define characteristics of parallelised clusters such as size, variability, inter and inner cluster distances. The notion of cluster density to compare clusters was defined as:
From here the problem is reduced by re-clustering any data that does not fit a defined density criteria. This is repeated to produce the final model. This is an interesting metric which may be useful in comparing the characteristics of clusters produced in parallel by different node threads to determine whether they are suitable to be joined together.
A MapReduce approach was also suggested whereby the data is initially split and broadcast to all mappers with distances being computed in parallel. This allows the construction of global variant centers and assigns each data observation to a cluster. Finally in the reducing/combine stage, collect the sum of observations assigned to each cluster for the distribution across all clusters. This implementation works very well with Kognitio’s MPP platform and we have already tested an implementation of this.
It was acknowledged that re-joining the clusters after parallel iterations is not a trivial process. A super K-means process where the clusters from the parallel processes are themselves clustered was suggested by a few groups and this would be interesting if the overfitting approach was used in the initial parallel processes.
Another suggestion that was interesting was to carry out secondary passes where similar clusters are forced into the same processing thread to see if the clusters are then brought together.
A final suggestion was made for a different approach to the clustering algorithm itself – expectation maximisation (EM), an unsupervised learning algorithm which lifts the limits of spherical clusters for more diversity using maximum likelihood estimates.
The KACE team are looking to implement some these ideas into our existing k-means clustering and investigate using the EM algorithm too. Our top priority will be robust amalgamation of the results from each thread before attempting to initialise better starting points.