Enhancing MPP performance with intelligent parallelism

Companies have long accepted that with the explosion in data volumes described in this Analytics Week article. this Inside Big Data article  and elsewhere, that Massively Parallel Processing (MPP) is the only way to deal with today’s data processing requirements. This Microsoft tech blog post outlines the reasons for moving to MPP for data processing. MPP systems can solve the problem of scalability and can bring the processing power of all processors in the system to bear. A single query can use the entire platform to crunch your data, and surely all is well with the world.

MPP performance limitations, and their impact on businesses

There are some general rules about MPP processing which indicate the performance gain available from parallelization.

For example, Amdahl’s law considers the parallelizable and non-parallelizable parts of a problem, and gives a limit on the performance gain available. If a problem takes 10 minutes with 1 processor, and 5 minutes of that work cannot be parallelized, then we can see that with 5 processors the total time could drop to 6 minutes (5 minutes for the non-parallized part, and 1 minute for the parallelized part). We can also see that however many processors you have, the time cannot drop below the 5 minutes required for the non-parallelizable portion.

Note that Amdahl’s law gives an upper limit on the performance gain from parallelization, because extra work is required to parallelize (e.g. coordination of the processors involved, including providing them with the details of the problem at the start, and collating their results at the end).

Other metrics, such as the Karp-Flatt metric take into account the cost of parallelization by determining the serial fraction of processing empirically, to ascertain whether an observed speedup is down to limited opportunities for parallelism, or because of increases in overhead as the number of processors increases.

In practice, this means businesses see the following issues:

Some operations do not benefit from an MPP platform of increased size, and can even suffer from worse performance than a smaller platform

The extra work involved in parallelization increases with the number of processors available – fundamentally, telling 1000 processors to start working on a problem is more work, and will take longer, than telling 10 processors to do the same thing. This is true whether you have the source processor tell all the other processors directly, or have a tree or other structure to pass messages through, so e.g. the source processor tells 10 other processors, each of which tell 10 more, each of which tell up to 10 more.

There are other operations which are more complex, such as replicating data to all processors which is currently distributed randomly across the processors (this is common for e.g. joining dimension tables to a fact table in databases) – rather than having 1 processor distribute a few KB to every other processor as in the case of the instruction to start work above, you have 1000 processors each trying to distribute a few KB to every other processor.

Note that this extra work has a disproportionately large impact on operations that have little work to do (e.g. they are operating on a small amount of data); the overhead of communication increases with the number of processors in use, but very little time is gained from parallelization when the parallelizable work doesn’t take long even on a single processor. Counter-intuitively, the time for operations with little work to do will tend to increase as resources increase. For example, if you have a billion rows to spread over 1000 processors, that seems reasonable. But if you have 100,000 rows and take the same approach, the execution time is dominated by setting up the operation including spreading the 100,000 rows over the processors, and relatively little time is taken processing the resulting 100 rows on each processor.

Often a whole task, such as a database query, involves a lot of small operations and relatively few large operations. In this scenario, the increase in time for running all those small operations can match or even exceed the gains from parallelization of the large operations. We’d like to avoid the extra work of parallelization to all processors in cases where we won’t recoup that cost by running the parallelizable part of the operation much more quickly.

All operations contend with each other, which hampers the achievable levels of concurrency

We also need to think about concurrency – it might make sense to split processing into 1000 processors if I have 1 job running, but if I have 1000 jobs running concurrently, does it pay to split each of them over 1000 processors, and hence have 1000 concurrent tasks per processor? Instead, I’d like to divide each job into e.g. 10 tasks, so each processor only ends up with 10 tasks rather than 1000. That way the processors spend more time doing useful work, and less time arbitrating between a large number of concurrent tasks. So ideally I would determine the level of parallelism for individual tasks with some weighting for the level of concurrent activity – I will be happier to parallelize to a larger number of processors when I have relatively low concurrency, but reduce the level of parallelization as concurrency increases.

This is particularly troublesome for the increasingly common Self Service applications that companies are rolling out, allowing suppliers, customers or internal staff to analyse data. This gives much higher concurrency requirements than have typically been seen in the past prior to this level of data democratization.

To resolve these problems we need Intelligent Parallelism – we want all the power and scalability of MPP systems, but they should scale the resource brought to bear on each piece of work appropriately. The balance between processing power and parallelization overhead must always be found for each piece of work – the amount of resource needed will vary with the data, and the system must dynamically allocate appropriate resources.

The benefits of this are:

  • an individual operation can run quicker if a small subset of resource is adequate for its processing, as the parallelization cost will be lower than that of using all the resource available.
  • more concurrent operations can run at the same time, because each one can use a subset of resource. So you see scale out benefits (coping with more users).
  • the full power of all resource is still available for operations that need it. So you see scale up benefits (coping with more data)

Intelligent Parallelism

The rest of this post talks about an approach to scale the number of servers brought to bear on a problem in line with the expected performance benefit, which Kognitio calls Intelligent Parallelism. This reflects the limits on performance improvement available by increasing the number of servers, and even the possibility of slowdown with increased servers when the cost of parallelization exceeds the gain from processing a fixed amount of data with more servers.

An initial approach taken by Kognitio to facilitate Self Service users with a well-known BI front-end was to identify instances when a single server held all the data required for a query – in this scenario, the MPP system had data distributed based on some attribute (e.g. the customer id), so customer-centric analysis which looked at a particular customer could have all processing done on the server holding that customer’s data. The benefit from removing the parallelization for these steps resulted in the performance of a single report being comparable to running the report across all processors with a different data distribution (this was a good example of the parallelization effort of sending code to hundreds of processors, and co-ordinating their processing far outweighing any potential speedup from processing the data on hundreds of processors rather than a single one).

However, concurrent performance was improved by more than two orders of magnitude, as now each report was essentially able to run on a different processor. In fact, the bottleneck became the front-end tool’s ability to render the results, rather than the database performance.

This opportunistic improvement has now matured to allow the same optimization when the data of interest is held on a subset of servers, and to apply the same principles to data which is not distributed in a helpful way to still scale processing appropriately. Note that sizing the resource used for a piece of work is dependent on the work to be performed AND the amount of load currently on the system.

Many queries will still utilize the whole platform initially (e.g. in a large table scan which looks for data of interest on any ad-hoc filter, rather than filtering on something predictable like customer or product), and reduce the data volume of interest in doing this initial scan, but can then run the remaining query steps (joining, sorting, aggregating, etc) on a suitably-sized subset of the platform. This gives better performance for an individual query, but critically also gives massive boosts for concurrent activity.

You can read more about intelligent parallelism, specifically Asymmetric Query Processing here and Location Hash Values here.

Leave a Reply

Your email address will not be published nor used for any other purpose. Required fields are marked *

Got lots of questions about Kognitio?

Check our FAQs for all the answers!

Read FAQs