Partition strategies for Kognitio external scripts

In this section the different partition strategies for Kognitio external scripts are introduced along with hints about when to use them. If you are new to external scripts then start with creating and invoking some simple scripts before moving on to more advanced configuration such as partition strategies covered here.

To control how partitions are fed to the external script include a partition strategy statement in the external script interface SQL:

{DEFAULT | SEPARATE | ISOLATE | MIX} PARTITIONS

Each of these partition strategies behaves slightly differently:

  • DEFAULT PARTITIONS: each script invocation RECEIVES the data for the first combination present in the ramstore. Once the script RECEIVES all the data associated with the first combination Kognitio automatically streams the data rows for the next combination into the same script invocation. This data streaming occurs without a break in the data stream. Therefore it is essential that any external script code running under the DEFAULT PARTITIONS strategy handles multiple combinations of part-column(s) one after the other. Typically the script code would track the current and previous part-column(s) value and on detecting a change send any output for the current value, set this to the previous value and reset any state for the new current value. For more details see the section on default behaviour and for an example see python script using default partitions to calculate averages
  • SEPARATE PARTITIONS: is identical to DEFAULT PARTITIONS except that all data for one partition are sent, followed by a blank line separator, before all data for another partition is sent and so on. The last partition may or may not be followed by a blank line separator. For example see python script using separate partitions to calculate averages
  • ISOLATE PARTITIONS: starts a new invocation of the script for each different combination of part-column (s) values. All the rows for one partition will be sent to the same isolated script invocation. This is the easiest partition strategy to implement because the external script code does not have to keep track of the part-column (s) values and detect changes. However there is significant overhead from starting lots of script invocations, especially if there are a large number of small partitions or the start-up time for the external environment is high. This needs to be weighed up against the processing time for the script itself.For example see python script using isolate partitions to calculate averages
  • MIX PARTITIONS: although all the rows for a given partition will be sent to the same script invocation this strategy allows rows from different partitions to be mixed together as they are received by the invocation. This is the most efficient way to send data into external scripts as rows can streamed directly into the appropriate script invocation without having to wait for a particular partition to be processed. This reduces the memory footprint within Kognitio. Within the script itself the complexity (and memory requirement) will increase as the code must track output for all combinations of part-column (s) values simultaneously and ensure any input is assigned to or processed as the correct combination.For example see python script using mixed partitions to calculate averages

When to use different the partition strategies

Kognitio has chosen the DEFAULT PARTITIONS strategy because it is most likely to be a good starting point for the majority of external script work. It is relatively straightforward to handle the partition changes in the majority of codes. In the majority of cases DEFAULT PARTITIONS work well for efficient streaming of data through the script invocations and the script code can remain simple in handling one partition value at a time. However there are exceptions where other strategies should be considered from the offset:

  • Complex Processing Environments such as R are typically used to create very complex analytical processes. It is feasible in these cases that the processing time for the analytics is much longer than the start-up cost of the environment. In these cases using ISOLATE PARTITIONS will make coding easier as you do not have to handle multiple partitions in each script. In general Kognitio recommend using ISOLATE PARTITIONS for all R external scripts unless the processing is extremely straightforward and the number of partitions is very large compared to the number of ram stores on the system. In these cases it is likely there is another software product that is more suited to the task required than R.
  • Third party data import strategy Many third-party software products prefer to import all data at the offset then process this before outputting results. Some software products, such as R, are not particularly efficient at receiving data row by row or as subsets (i.e. by partition). Handling data input should not be the focus of effort when using external scripting in Kognitio. The streaming or import of data should be straightforward and if this proves problematic and the default behaviour of streaming one partition after another is causing too much complexity consider using SEPARATE PARTITIONS for software that can loop easily based on a blank row or ISOLATE PARTITIONS and remove the problem of partition streaming entirely.
  • Dense Aggregation Processes If there is a large input dataset and a small result set, typically an aggregation process at the partition level where the number of partition values is relatively small, this can lead to a large memory requirement within Kognitio. If DEFAULT PARTITONS are used results from the query forming the input dataset will need to be stored in memory while waiting for each partition value to be processed by the script invocation. For large input data sets the memory overhead associated with the input dataset can utilize too much RAM. In this case consider using MIXED PARTITIONS. In this case input data is streamed directly into script invocations and Kognitio RAM footprint is reduced. Within the script output for all partition values must be stored simultaneously and code must be written to ensure processing on each partition can be done as input data is encountered. This is typically achieved by creating an array for the output and using the partition value as a key for the array. Once all input is streamed into the script and processed the whole array is output in one step.