Ordering data without partitioning

In some circumstances it might be preferable to avoid partitioning in Kognitio external scripting. For example if a data set is large and data redistribution based on the partition hash is therefore costly. .. cannot easily be .. hashed on the required part-column(s) prior to being fed to the script .. invocations.

In this example the ORDER BY clause can be useful to stream into the script invocations in an order that replicates partitioning. The ordering is done in the ramstore prior to being received by the script invocation. The advantage is that this ordering is done in each ramstore where the data already resides so data transfer to satisfy the partition hashing is avoided.

There is an additional requirement to process the script output as multiple rows for any part-column value (one from each script invocation) must be handled appropriately when sent back into Kognitio.

Example python script using ordering only to obtain averages and first seen date

The external script below is identical to example introduced in the previous example except that the PARTITION BY clause is replaced by a corresponding ORDER BY clause:

CREATE EXTERNAL SCRIPT AvgOverPartID_Order ENVIRONMENT python27
RECEIVES( part_id , invalue FLOAT, datevalue DATE)
ORDER BY part_id, datevalue
SENDS( part_id , avg1 FLOAT,count1 int, firstdate date)
SCRIPT S'EOF(
import csv,sys
input = csv.reader(sys.stdin)
result = csv.writer(sys.stdout)
#
#Initialise process control vars
prevpartid="" #Holds the previous partition id
sum1=0 #Initialise sum
count1=0 #Initialise count
#
# Process one row at a time.
for row in input:
    # Check row is not empty and set nstop
    if len(row)>0:
        # Has part id changed?
        # if not increment sum and count
        if row[0]==prevpartid:
            sum1=sum1+float(row[1])
            count1=count1+1
        # if it has write out sum and count and reset
        # do not write out if it is first row of input
        else:
            if prevpartid!="" and count1>0 and sum1>0:
                avg1=sum1/count1
                result.writerow([prevpartid,avg1,count1,firstdate])
            prevpartid=row[0]
            sum1=float(row[1])
            count1=1
            # know first date is in the first row as ordered data
            firstdate=row[2]
#
#write out final result after loop finished
if count1!=0:
    avg1=sum1/count1
    result.writerow([prevpartid,avg1,count1,firstdate])
)EOF';

The SQL query to obtain results using the Kognitio retail data is similar to that in the example where PARTITION BY is used except it must include an additional GROUP BY statement and SUM and MIN to aggregate the results for each product that come from each script invocation:

SELECT product_name,
,part_id prod_no
,CAST(SUM(avg1*count1)/SUM(count1) AS DEC(6,2)) avg_price
,SUM(count1) numtrans
,MIN(firstdate) first_sold_date
FROM (
    EXTERNAL SCRIPT AvgOverPartID_Order
    FROM (
        SELECT prodno, price/100.00, saledate
        FROM demo_ret.v_ret_sale
        WHERE storeno BETWEEN 1 AND 5
    )
) dt1
JOIN demo_ret.v_ret_product p
ON dt1.part_id=p.prodno
GROUP BY product_name, part_id
ORDER BY part_id;