Partitioning by column in external scripts

This advanced example shows how to organize the data sent to Kognitio external scripting with the default behaviour. Typically you’d want all the data of one kind to go to one invocation of an external script to be processed together, each of these defined by some column in your input data is called a partition.

Read about the different partition strategies and the advanced configuration for other ways to get the most from external scripts.

Example: python script using partitions to calculate averages

The script below shows a simple python example for calculating the average of the invalue field received by the script. The script is partitioned on the part_id field and this means an average (avg1) and count (count1) is calculated and sent back from the external script invocations for each part_id value:

CREATE EXTERNAL SCRIPT AvgOverPartID_Def ENVIRONMENT python27
RECEIVES( part_id, invalue FLOAT)
PARTITION BY part_id
SENDS( part_id, avg1 DECIMAL(8,2), sum1 DECIMAL(8,2),count1 INT)
REQUIRES 1 GB RAM
SCRIPT S'EOF(
import csv,sys
input = csv.reader(sys.stdin)
result = csv.writer(sys.stdout)
#
#Initialise process control vars
prevpartid=""   #Holds the previous partition id
sum1=0           #Initialise sum
count1=0         #Initialise count
#
# Process one row at a time.
for row in input:
    # Check row is not empty and set nstop
    if len(row)>0:
        # Has part id changed?
        # if not increment sum and count
        if row[0]==prevpartid:
            sum1=sum1+float(row[1])
            count1=count1+1
        # if it has write out sum and count and reset
        # do not write out if it is first row of input or if the sum and count are still zero
        else:
            if prevpartid!="" and count1>0 and sum1>0:
                avg1=sum1/count1
                result.writerow([prevpartid,avg1,sum1,count1])
            prevpartid=row[0]
            sum1=float(row[1])
            count1=1
#
#write out final result after loop finished
if count1!=0:
    avg1=sum1/count1
    result.writerow([prevpartid,avg1,sum1,count1])
)EOF';

Within the Python script the previous value of part_id is stored (in prevpartid) so that when the value of part_id changes the average and count is calculated and sent back to Kognitio then the script varaibles are re-initialised. Keeping track of changes in part_id value is necessary when the DEFAULT PARTITIONS strategy is used. See the partitioning strategies section to understand how other partitioning strategies can be applied to the same example.

Note

the data type of part_id is not defined, it is therefore taken from data type in the query results used as input into the external script. This data type is simply carried through to the results the script sends back to Kognitio because the values for part_id are just copied and not changed by the python code.

The script can be used to calculate averages of any field used as the partition field. For example, in the SQL query below, average product price is produced:

SELECT p.product_name
    ,dt1.avg1 avg_price
    ,CAST(dt1.count1 as INTEGER) numtrans
FROM ( EXTERNAL SCRIPT AvgOverPartID_Def
       FROM ( SELECT prodno, price/100.00
              FROM demo_ret.v_ret_sale
              WHERE storeno BETWEEN 1 AND 5)
     ) dt1
JOIN demo_ret.v_ret_product p
ON dt1.part_id = p.prodno
ORDER BY 3 DESC;

Note

this example is based on the Kognitio retail demo data set. For more information about loading this for your chosen deployment, refer to the getting started guide

This external script example is for illustrating partition strategies with a simple coding example - namely calculating averages. It is much more efficient to use SQL for simple calculations, such as averaging, if possible. In this case the SQL below gives the same result as the external script call above:

SELECT p.product_name
    ,AVG(price)/100.00 avg_spend
    ,COUNT(1) numtrans
FROM demo_ret.v_ret_sale s
JOIN demo_ret.v_ret_product p
ON s.part_id=p.prodno
WHERE storeno BETWEEN 1 AND 5
GROUP BY p.product_name
ORDER BY 3 DESC;