Default behaviour for external scripts

When a Kognitio external script is created without the optional configuration settings (introduced in the script creation section) then any similar optional configuration defined in the script environnment is applied to the script. Therefore setting the optional configurations on the script environment can be considered as setting up a default configuration for all external scripts using the environment. If these configuration settings are also omitted then Kognitio default external script behaviour is applied. This is discussed in detail below.

Default Invocations

Kognitio is designed for massively parallel processing and the default behaviour is set to exploit parallelism as much as possible. All nodes are utilized fully and therefore the number of script invocations is controlled by the resource scheduler. This will vary depending on available system resource when the invocations are started.

In default behaviour there is no limit on the number of invocations so if there is enough resource your external script call will result in one script invocations per ramstore on your system. This maximises parallelism of your code. If you need to limit this then use LIMIT N THREADS.

Some invocations may not receive data

The script invocations allocated by the resource scheduler are started regardless of the amount of data being received and processed by the script. In some circumstances, such as when the input data set is small or the number of partitions is less than the number of invocations, some of the invocations may not receive data. This scenario must always be handled within the external script to avoid errors.

Examples:

Default behaviour equivalent syntax

Considering the full external script syntax, Kognitio default external script functionality means, provided no (optional) script environment configuration has been set, that the command

CREATE EXTERNAL SCRIPT script-name
ENVIRONMENT script-environment
[RECEIVES({input-column [data-type]}, ...)]
[PARTITION BY part-column, ... ORDER BY order-column, ...]
SENDS ({result-column [data-type]}, ...)
SCRIPT 'extscript(
  Code in language matching script-environment
)extscript';

is exactly equivalent to the command

CREATE EXTERNAL SCRIPT script-name
ENVIRONMENT script-environment
[RECEIVES({input-column [data-type]}, ...)]
[PARTITION BY part-column, ... ORDER BY order-column, ...]
SENDS ({result-column [data-type]}, ...)
SET NO THREADS LIMIT
REQUIRES 0 GB RAM
RUN ON ALL
[DEFAULT PARTITIONS]
SCRIPT 'extscript(
  Code in language matching script-environment
)extscript';

Default partitioning strategy

When a PARTITION BY clause is included in the creation statement then the parallelization strategy used is DEFAULT PARTITIONS unless another strategy is specified. See the section on partition strategy types for more information.

Under all partitioning strategies the data is hashed according to the values of the part-column(s). All rows associated with each value combination will be sent to the same ramstore in the same way as hashing is carried out for a JOIN or GROUP BY in an SQL query.

A ramstore passes all its data to an available script invocation before another ramstore can use the invocation. If there are less script invocations than ramstores then the ramstores wait for a script invocation to become available.

To ensure that all rows for a given partition value are received by the script invocation before the rows of the next value Kognitio performs a sort on the part-column(s) values prior to streaming the data into the script invocation. This means when DEFAULT PARTITIONS are used the input data must be held and sorted in a Kognitio temporary (RAM based) table. Therefore if the input data set is large this can result in a significant RAM overhead. See details on MIX PARTITIONS to avoid this issue

When DEFAULT PARTITIONS are applied each script invocation RECEIVES the data for the first combination present in the ramstore. Once the script RECEIVES all the data associated with the first combination Kognitio automatically streams the data rows for the next combination into the same script invocation. This data streaming occurs without a break in the data stream. Therefore it is essential that any external script code running under the DEFAULT PARTITIONS strategy handles the change in the values of part-column(s) one after the other.

A ramstore then orders the data by part-column(s) to ensure all data for a given value is passed into the script together.

Example: python script using partitions to calculate averages

No partitioning in external script syntax

If an external script has no PARTITION BY clause then data is not specifically hashed prior to being received by the external script invocation. The data forming the input into the script invocation will be streamed directly from a ramstore. If there is no local script invocation available a ramstore will pass its portion of the data to another node or conatiner for processing. This minimizes network traffic between nodes or containers. The results sent back to Kognitio from a script invocation are also held in the local ram-store or streamed onto the next phase in the process.

This processing model (with no partitioning) is ideal for operations requiring NoSQL operations on each row when there is no requirement to order or partition the data, such as complex data transformations. The movement of data is minimized as each input row is streamed through the external script invocation and output result columns are generated locally if possible. No partitioning means input data does not need to be sorted so Kognitio RAM overhead is also low.