How to use Kognitio external scripts with SQL

On this page we’ll show you how to perform common tasks with Kognitio external scripts using examples. We’re going to be using the Star Trek data set used in the concepts examples page.

How to execute an external script inline or create an object for later use

External scripts can be run inline which is good for development or testing. However, they can also be created in Kognitio as an object in a schema. This is often preferred in production where you may want to deploy your script for use with different data sources or for use by other people.

For more details on the syntax, see the full external script DDL

To execute an external script inline we just need to tell it what script environment to use and define the input data with the FROM SQL input query after the script declaration:

EXTERNAL SCRIPT USING ENVIRONMENT local_python3
RECEIVES(...)
SENDS(...)
LIMIT 1 THREADS
SCRIPT S'EOF(
import csv, sys
input = csv.reader (sys.stdin)
output = csv.writer (sys.stdout)
data_array = []
#
# Read in data
for input_row in input:
    data_array.append(input_row)
#
# Output data
for output_row in data_array:
    output.writerow(output_row)
)EOF'
FROM(SELECT *
     FROM star_trek.character_list);
EXTERNAL SCRIPT USING ENVIRONMENT local_r
RECEIVES(...)
SENDS(...)
LIMIT 1 THREADS
REQUIRES 200 MB RAM
SCRIPT S'EOF(
require(data.table)
# Read in data
input_data<-fread(cmd='cat -', header = FALSE)
#
# Output data
fwrite(c(input_data),sep=',',row.names = FALSE, col.names = FALSE)
)EOF'
FROM(SELECT *
    FROM star_trek.character_list);

The syntax for creating an external script for later use is very similar to the inline example above. We just need to specify a schema (which defaults to the current schema) and a script name. Unlike the inline version, no input data is specified since this will be provided when the script is executed:

CREATE EXTERNAL SCRIPT star_trek.python_simple_io ENVIRONMENT local_python3
RECEIVES(...)
SENDS(...)
LIMIT 1 THREADS
SCRIPT S'EOF(
import csv, sys
input = csv.reader (sys.stdin)
output = csv.writer (sys.stdout)
data_array = []
#
# Read in data
for input_row in input:
    data_array.append(input_row)
#
# Output data
for output_row in data_array:
    output.writerow(output_row)
)EOF';

To execute the external script we use the following SQL where we provide the input data via the FROM SQL input query:

EXTERNAL SCRIPT star_trek.python_simple_io
FROM(SELECT *
     FROM star_trek.character_list);

If we want to run the same script but with a different data set, we can do so. For example, with only Vulcan characters:

EXTERNAL SCRIPT star_trek.python_simple_io
FROM(SELECT *
     FROM star_trek.character_list
     WHERE species = 'Vulcan');
CREATE EXTERNAL SCRIPT star_trek.r_simple_io ENVIRONMENT local_r
RECEIVES(...)
SENDS(...)
LIMIT 1 THREADS
REQUIRES 200 MB RAM
SCRIPT S'EOF(
require(data.table)
# Read in data
input_data<-fread(cmd='cat -', header = FALSE)
#
# Output data
fwrite(c(input_data),sep=',',row.names = FALSE, col.names = FALSE)
)EOF'
FROM(SELECT *
    FROM star_trek.character_list);

To execute the external script we use the following SQL where we provide the input data using the FROM clause in the SQL query:

EXTERNAL SCRIPT star_trek.r_simple_io
FROM(SELECT *
     FROM star_trek.character_list);

If we want to run the same script but with a different data set, we can do so. For example, with only Vulcan characters:

EXTERNAL SCRIPT star_trek.r_simple_io
FROM(SELECT *
     FROM star_trek.character_list
     WHERE species = 'Vulcan');

(Back to top)

Using external scripts as SQL derived tables

The results of an external script may be utilized in other SQL queries. This allows you to use external script output as a derived table in any SQL query. It is particular useful when you want to:

  • Join results back to look-up tables - for example when you want use keys in the script and then denormalise for presentation.

  • Return multiple result sets by tagging each line with the result set it belongs to.

  • Pipeline scripts by feeding the output of one into the input of the next one.

In the example below we’re going to use our simple read in and out script with two minor edits:

  • change the output delimiter in the script to ^. Kognitio will then interpret the comma separated output as a single string and return it in a single column.

  • define SENDS to only return a single results column.

We can then parse the output column in SQL using the delimiter and the STRCHOP function.

CREATE EXTERNAL SCRIPT star_trek.python_simple_io2 ENVIRONMENT local_python3
RECEIVES(...)
SENDS(results VARCHAR)
LIMIT 1 THREADS
SCRIPT S'EOF(
import csv, sys
input = csv.reader (sys.stdin)
output = csv.writer (sys.stdout,delimiter ='^')
data_array = []
#
# Read in data
for input_row in input:
    data_array.append(input_row)
#
# Output data
for output_row in data_array:
    output.writerow(output_row)
)EOF';

-- Use the external script as a derived table
SELECT STRCHOP(res,'^',1,1) title
      ,STRCHOP(res,'^',2,2) main_series
      ,CAST(STRCHOP(res,'^',3,3) as int) yob
      ,STRCHOP(res,'^',4,4) secies
      ,STRCHOP(res,'^',5,5) name
FROM( EXTERNAL SCRIPT star_trek.python_simple_io2
      FROM(SELECT *
           FROM star_trek.character.character_list)
)dt;
CREATE EXTERNAL SCRIPT star_trek.r_simple_io2 ENVIRONMENT local_r
RECEIVES(...)
SENDS(results varchar)
LIMIT 1 THREADS
REQUIRES 300 MB RAM
SCRIPT S'EOF(
require(data.table)
# Read in data
input_data<-fread(cmd='cat -', header = FALSE)
#
# Output data
fwrite(c(input_data),sep='^',row.names = FALSE, col.names = FALSE)
)EOF';

-- Use the external script as a deviced table
SELECT STRCHOP(res,'^',1,1) title
      ,STRCHOP(res,'^',2,2) main_series
      ,CAST(STRCHOP(res,'^',3,3) as int) yob
      ,STRCHOP(res,'^',4,4) secies
      ,STRCHOP(res,'^',5,5) name
FROM( EXTERNAL SCRIPT star_trek.r_simple_io2
      FROM(SELECT *
           FROM star_trek.character.character_list)
)dt;

Note it is possible to embed inline anonymous scripts as derived tables with the full external script code written within the SQL statement but for large scripts this can become unweildy. For production purposes we recommend that external scripts used regularly as derived tables should be created as schema objects.

(Back to top)

Using input of undefined type

We saw briefly in the previous example that we can supply different data sets to an external script. In this example we show how to use a perl script to parse text from an input column of undefined type.

The script has no declaration of data type in the RECEIVES clause so this is inferred automatically from the input data by Kognitio. This means that the script can be used with any input data type (provided the perl code can read it). We show three different inputs: a simple sentence, google search url and results of an SQL query

CREATE EXTERNAL SCRIPT myschema.word_split ENVIRONMENT perl
RECEIVES (sometext)
SENDS(words VARCHAR(200))
SCRIPT S'EOF(
while(<>)
{   #remove trailing \n
    chomp();
    #
    #replace punc. and numbers with white space
    s/[\,\.\!\_\{\}\(\)\#\=\+\:\\0-9]/ /g;
    #
    # split on whitespace of any length
    foreach $c (split(/ +/))
    {
        #print if only made up of letters
        if($c =~ /^[a-zA-Z]+$/)
        {
            print "$c\n"
        }
    }
}
)EOF';

--Query 1 simple sentence
EXTERNAL SCRIPT myschema.word_split
FROM (SELECT 'The quick brown box jumped over the lazy dog.');

--Query 2 google search url
EXTERNAL SCRIPT myschema.word_split
FROM (SELECT 'https://www.google.co.uk/?gws_rd=ssl#q=aesop%27s+fables+and+greek+myths');

--Query 3 results from query
EXTERNAL SCRIPT myschema.word_split
FROM (SELECT command FROM sys.ipe_command);

The first 10 results from each query are shown below:

../../_images/word_split.png

(Back to top)

External scripts with variable input

When the RECEIVES(...) clause is used this means that the input into the script can be variable. All column information is inferred from the SQL input query so use column aliases in the input query to define column names.

A script can find out what columns have been supplied by setting the INPUT clause to 'column_headers 1, column_header_format 0'. This tells Kognitio to send a header row containing the column names before sending the data. column_header_format 0 is just a list of the column names, see formatting input and output for details of how to get the data types as well.

In this example we use the Python json library to convert our Star Trek data into a json object:

CREATE EXTERNAL SCRIPT star_trek.to_json ENVIRONMENT local_python3
RECEIVES(...)
INPUT 'column_headers 1, column_header_format 0'
SENDS(out varchar(32000))
OUTPUT 'fmt_field_separator "^", fmt_naive_delimiting 1'
LIMIT 1 THREADS
SCRIPT S'EOF(
import csv, sys, json
input = csv.reader(sys.stdin)
output = []
firstline = True

for input_row in input:
    if firstline:
        header = input_row
        firstline = False
    else:
        rec = {}
        # iterate over the header row creating a dictionary entry for each column
        for index, col in enumerate(header):
            rec[col] = input_row[index]
        output.append(rec)

print(json.dumps(output))
)EOF';

Now we can use the external script to process two different queries:

EXTERNAL SCRIPT star_trek.to_json
FROM(SELECT title, main_series AS series, name
     FROM star_trek.character_list WHERE main_series = 'TNG');

returns

[{
        "Title": "Captain",
        "Series": "TNG",
        "Name": "Jean-Luc Picard"
}, {
        "Title": "Commander",
        "Series": "TNG",
        "Name": "Data"
}, {
        "Title": "Commander",
        "Series": "TNG",
        "Name": "Worf"
}, {
        "Title": "Doctor",
        "Series": "TNG",
        "Name": "Beverly Crusher"
}, {
        "Title": "Lieutenant",
        "Series": "TNG",
        "Name": "Wesley Crusher"
}]

and:

EXTERNAL SCRIPT star_trek.TO_JSON
FROM(SELECT species, COUNT(*) AS number
     FROM star_trek.character_list GROUP BY 1);

returns

  [{
          "Species": "Vulcan",
          "Number": "3"
  }, {
          "Species": "Human",
          "Number": "6"
  }, {
          "Species": "Android",
          "Number": "1"
  }, {
          "Species": "Klingon",
          "Number": "2"
  }]

Note: the output has been pretty printed, the actual output is more compact.

(Back to top)