Creating external scripts

This section covers how to create Kognitio external scripting allowing you to run non-SQL code within Kognitio SQL queries. This allows easy access to complex analytics for the wider business audience without having to understand the underlying language. Ensure that you have script environments set up before trying these examples.

External scripts are created in the same way as other Kognitio objects such as tables and views.

Mandatory External Script Elements

The external script DDL provides full syntax but the minimum mandatory elements of an external script are:

  • A script environment to run in. This defines the programming language e.g. R, Python, Bash etc. For some examples see the scripting environment overview
  • At least one column in SENDS declaration for the results to be sent back to Kognitio
  • The SCRIPT itself containing the code to run in the selected environment

A simple Python 2.7 example to print a message can be created using

CREATE EXTERNAL SCRIPT my_schema.hello_world ENVIRONMENT python27
SENDS(result VARCHAR)
SCRIPT S'EOF(
#
#Print a message
print "Hello world"
#
)EOF';

EXTERNAL SCRIPT myschema.hello_world;

The second SQL statement runs the script. For more details on running scripts see invoking external scripts.

Optional External Script Elements

These elements aren’t necessary when creating external scripts but are heavily utlized. Use cases and links to the advanced sections for more in-depth functionality are given below.

RECEIVES - defining input

Defines column definitions for the data that is received from SQL for the external script to process. External scripts don’t always need to receive data, for example if its job is to generate data or send back information about the nodes. However, the vast majority of scripts will include a RECEIVES statment

Example 1 - print out data

This simply prints the values that it receives. Separate column definition by using commas.

CREATE EXTERNAL SCRIPT my_schema.receives_eg1 ENVIRONMENT python27
RECEIVES(value INT, value2 INT)
SENDS(value INT, value2 INT)
SCRIPT S'EOF(
#
#Import packages
import sys,csv
#
#Define input and output streams
input = csv.reader(sys.stdin)
output = csv.writer(sys.stdout)
#
#Write out rows from input stream to output
for num in input:
    output.writerow([num[0],num[1]])
#
)EOF';

EXTERNAL SCRIPT my_schema.receives_eg1
FROM(
    --Generate two columns of integers for input
    SELECT value, value*2 FROM VALUES BETWEEN 1 and 10
);

Example 2 - handling lots of columns

Having too many columns makes it tedious to define them all in the receives clause. If re-casting the column names from the SQL providing the data isn’t necessary, using “…” in the RECEIVES statement will take all column inputs from the SQL query. This makes the input data more dynamic. However this relies on the script to manage the input and requires more consideration of the types of data the scripts may receive.

Note

The “...” may also be used in the SENDS clause but in this case it must be used in RECEIVES clause and the number of columns and their types must match.

CREATE EXTERNAL SCRIPT my_schema.receives_eg2 ENVIRONMENT python27
RECEIVES(...)
SENDS(value INT, value2 INT)
SCRIPT S'EOF(
#
#Import packages
import sys,csv
#
#Define input and output streams
input = csv.reader(sys.stdin)
output = csv.writer(sys.stdout)
#
#Write out rows from input stream to output
for num in input:
    output.writerow([num[0],num[1]])
#
)EOF';

EXTERNAL SCRIPT my_schema.receives_eg2
FROM(
    --Generate two columns of integers for input
    SELECT value, value*2 FROM VALUES BETWEEN 1 and 10
);

ORDER BY - ordering input

Specifies the column(s) to order the data on. Kognitio is extremely efficient at ordering data unlike some languages such as R. Therefore using ORDER BY when creating external scripts means you don’t have to worry about ordering within the script itself. This is really useful for time series, for example.

Example - specifying order

This script orders the values before sending it to the external scripts.

Note

If used with PARTITION BY then the ordering is done within each partition similar to a analytical windowing function ordering in SQL.

For more details and advanced examples see the configuration page for ordering input.

CREATE EXTERNAL SCRIPT my_schema.order_eg1 ENVIRONMENT python27
RECEIVES(value INT) ORDER BY value
SENDS(value INT)
SCRIPT S'EOF(
#
#Import packages
import sys,csv
#
#Define input and output streams
input = csv.reader(sys.stdin)
output = csv.writer(sys.stdout)
#
#Write out rows from input stream to output
for num in input:
    output.writerow([num[0]])
#
)EOF';

EXTERNAL SCRIPT my_schema.order_eg1
FROM(
    --Generate a column of integers for input
    SELECT value FROM VALUES BETWEEN 1 and 10
)

PARTITION BY - how to split input

Specifies the column(s) used to split the data. It ensures that all data of a chosen partition gets sent to the same script invocation rather than allowing Kognitio to send data to any invocation (see default behaviour). Using partitions enusres that all data required for any given process is sent to the same invocation. Note that this does not guarantee the data within each partitions is ordered; use together with the ORDER BY clause to order within partitions.

Example - using partitions

This script won’t produce a different result to the example above but it groups values of the same partition together and sends them to the same invocation. When partition strategies is used, this PARTITION BY column declaration is required.

Note

Any partition column must also be specified explicitly in a RECEIVES column declaration.

CREATE EXTERNAL SCRIPT my_schema.partition_eg1 ENVIRONMENT python27
RECEIVES(value INT) PARTITION BY value
SENDS(value INT)
SCRIPT S'EOF(
#
#Import packages
import sys,csv
#
#Define input and output streams
input = csv.reader(sys.stdin)
output = csv.writer(sys.stdout)
#
#Set total variable
total=0
#
#Iterate through input rows, adding to total
for num in input:
    total+=int(num[0])
#
#Only write out the result if the total has increased
#i.e. if this invocation received any data
if total > 0:
    output.writerow([total])
#
    )EOF';

 EXTERNAL SCRIPT my_schema.partition_eg1
 FROM(
  --Generate integers for input
      SELECT value FROM VALUES BETWEEN 1 AND 5
      UNION ALL
      SELECT value FROM VALUES BETWEEN 1 AND 5
 );

See the partition strategy pages and examples for more details about using partitions.

PARTITIONS - how to split workload

This strategy controls how the external scripts invocations receive data partitions defined in the PARTITION BY statement and how Kognitio handles the transistion between different data partitions. If not specified then the default partition strategy is used.

Overview of PARTITIONS strategy (click on each for an example):

  • Default partitions - This sends data partitions one after another without starting up a new script invocation. This means the script itself needs to check for changes in partitions (if required).
  • Separate partitions - Same as default but inserts a blank row between partitions to help differentiate the partitions.
  • Isolate partitions - Starts up a new script invocation for each data partition so the script does not need to be aware of partition changes. This is the easiest to use if the processing of each partitions is independent of each other. However starting new invocations for each partition will introduce an increase in total run time, particularly if there are lots of distinct values for the PARTITION BY columns.
  • Mix partitions - guarantees all rows associated with a partition are sent to the same invocation but interleaves rows from different data partitions together. This is the most efficient strategy for minimizing memory overhead as rows are passed to script invocations immediately. The script needs to be able to track and process multi partitions throughout its invocation.

When developing parallel external scripts, the partition strategy should be one of the first considerations. This is explained in detail in partition strategies and when to use them.

REQUIRES K MB RAM - control memory usage

Sets the upper limit for amount of memory each invocation of the script needs. By default this is 100 MB. Increase the limit when using large packages or using very large structures. Some languages like R use more memory than others such as Python and Bash. They often need more memory to operate than 100 MB.

Follow good coding practice try to minimize memory use by dropping objects that are no longer required as the process progresses.

Note

Don’t just set REQUIRES to a high value as Kognitio uses this, along with the available free RAM, to calculate how many parallel invocations of the script to set up.

The advanced section on setting memory for external scripts has instructions on how to find out how much memory your external script is using.

Example - setting memory limit

Try adjusting the memory in this inline script to see how much memory is needed to load the Python packages. Also add in thread limitations: LIMIT n THREADS PER NODE and see how it affects available memory. A message will appear for every successful script invocation.

EXTERNAL SCRIPT USING ENVIRONMENT python27
SENDS(result VARCHAR)
REQUIRES 100 MB RAM
SCRIPT S'EOF(
#
#Import packages
import sys,csv
#
#Declare a large array
buckets=[0]*10000000
#
#Define output stream and success message
output = csv.writer(sys.stdout)
output.writerow(['Successfully imported packages'])
#
)EOF';

LIMIT N THREADS [PER NODE] - control limt of concurrent script invocations

This sets an upper limit for the amount of processes an external script can invoke. Depending on the amount of available resource the number of invocations will be equal or less than this limit. Whilst the resource provided to external script invocations is managed automatically, this is still useful for limiting invocations for specific tasks. For example, wanting to find the Python packages installed on each node.

Example - Single script invocation

This example will only return one row as we’ve limited it to one thread. You can also use PER NODE to limit the threads on each node or container.

CREATE EXTERNAL SCRIPT my_schema.threads_eg1 ENVIRONMENT python27
SENDS(result VARCHAR)
LIMIT 1 THREADS
SCRIPT S'EOF(
#
#Import packages
import sys,csv
#
#Define output stream and print a message
output = csv.writer(sys.stdout)
output.writerow(['This is a single thread'])
#
)EOF';

EXTERNAL SCRIPT my_schema.threads_eg1;

Example - Single script invocation per node

This will return up to one row for every node or container you have in your Kognitio cluster. This is useful for scripts that carry out tasks using node information such as checking installed packages

Note

This does not guarantee an invocation will definitely be run on every node. If available resources are very low some nodes may not get any invocations. It is an upper limit only.

CREATE EXTERNAL SCRIPT my_schema.threads_eg2 ENVIRONMENT python27
SENDS(result VARCHAR)
LIMIT 1 THREADS PER NODE
SCRIPT S'EOF(
#
#Import packages
import sys,csv
#
#Define output stream and print a message
output = csv.writer(sys.stdout)
output.writerow(['This is a thread'])
#
)EOF';

EXTERNAL SCRIPT my_schema.threads_eg2;

See the invocation control and examples in the advanced section for more details and use cases.

RUN ON NODES - control which node(s) to run on

Set the external script to run on spe cific nodes only. Useful for testing new packages before installing on all of the nodes and troubleshooting.

Example - specify nodes

Find a node name from here under WX2_NODE_NAME:

SELECT * FROM "SYS"."IPE_NODEINFO";

Note it down and paste it into the SQL query below. This will print out the node name which should be the same as the one you placed in the script creation and should be the only one that appears. Swap RUN ON for NOT RUN ON to get the same information from the other nodes.

CREATE EXTERNAL SCRIPT my_schema.node_name ENVIRONMENT python27
SENDS(res VARCHAR)
RUN ON <your-WX2_NODE_NAME>
SCRIPT S'EOF(
#
#Package import
import os,sys,csv
#
#Output streams
result = csv.writer(sys.stdout)
#
#Write out results
hs=os.system('hostname')
if hs!=0:
    result.writerow([hs])
#
)EOF';

EXTERNAL SCRIPT my_schema.node_name;

See the advanced section on controlling nodes for an example on finding out which Python packages are installed on each node of your system.