Controlling external script invocations

When listing the files using a Kognitio external script running in a bash script environment, there’s somestimes no need to for multiple invocations of this script to be run. Particularly if combined with the “run on node” capability.

Example: list the files in a linux directory on every node

The external script for listing files on a single node can easily be extended to run on all nodes in the Kognitio system employing the LIMIT n THREADS PER NODE syntax:

CREATE SCHEMA test;
SET SCHEMA test;

CREATE EXTERNAL SCRIPT test.list_files_by_node ENVIRONMENT bash
RECEIVES(directory VARCHAR(32000))
SENDS(filename VARCHAR(100))
LIMIT 1 THREADS PER NODE
SCRIPT S'EOF(
  while read dir;
    do ls -a $dir
  done
)EOF';

When the SQL statement:

EXTERNAL SCRIPT test.list_files_by_node
FROM (SELECT '/')

is run only one set of results is still produced. This is because only one row of data is provided so only one script invocation receives the value ‘/’. The other invocations receive no data and therefore send no data back

To overcome this the best option is to pass the directory as an environment variable into the external script. This is the most straightforward way of passing simple information to every invocation of an external script.

The external script above is modified slightly to track the nodes and the directory it has processed. There is no longer any need for a RECEIVES statement as the only input is the directory path is passed in the WX2_DIR environment variable:

DROP EXTERNAL SCRIPT test.list_files_by_node;
CREATE EXTERNAL SCRIPT test.list_files_by_node ENVIRONMENT bash
SENDS(hostname varchar(100), directory varchar(100), filename VARCHAR(100))
LIMIT 1 THREADS PER NODE
SCRIPT S'EOF(
    hs=$(hostname)
    ls -a $WX2_DIR | while read line;
        do echo "$hs,$WX2_DIR, $line";
    done;
)EOF';

The names of any environment variables passed into the script by Kognitio are automatically capitalised and prefixed by WX2_. This must be handled by the script. In the example above the environment variable WX2_DIR must contain a valid directory path. The script is executed using the SQL:

EXTERNAL SCRIPT list_files_by_node
PARAMETERS dir='/'

Note the parameter prefix WX2_ is not used in the SQL. This script can be used to check that all nodes are synced by counting the number of nodes that contain each file using the SQL below. The COUNT should match the number of nodes on the Kognitio system:

SELECT directory,
    filename,
    COUNT(hostname) num_nodes
FROM( EXTERNAL SCRIPT list_files_by_node
      PARAMETERS dir='/') dt1
GROUP BY directory, filename
ORDER BY 3;