Mixed partitions

For Kognitio external scripts there are different ways to partition data to be processed. MIXED PARTITIONS ensures all the data of each partition is sent to the same script invocation but it’s not sorted meaning that the script will get some of partition A and then some partition B etc. This requires more thought in the coding process but saves in resources needing to sort the data before sending to the external script. If the data is very large then the inovcations may need more memory to hold multiple partitions’ data at a time.

There are three more partition strategies available in Kognitio and the corresponding script for the default, seperate and isolate partition strategies are also available.

Example: python script using mixed partitions to calculate averages

Consider the following external script using MIXED PARTITIONS:

CREATE EXTERNAL SCRIPT AvgOverPartID_Mix ENVIRONMENT python27
RECEIVES(part_id, invalue FLOAT)
PARTITION BY part_id
SENDS(part_id, avg1 FLOAT, count1 INTEGER)
MIX PARTITIONS
script S'EOF(
#
import csv,sys
input = csv.reader(sys.stdin)
result = csv.writer(sys.stdout)
#
#Initialise process control lists for all values
partid=[]     #Initialise partition list
sum1=[]       #Initialise sum list
count1=[]     #Initialise count list
avg1=[]       #Initialise avg list
#
#Process row by row
for row in input:
    # Check row is empty, if not get values
    if len(row)>0:
        id=row[0]
        value=float(row[1])
        # check count list to see if id has been seen before
        # (use id as the key to list)
        if partid.count(id)>0:
            # if so then increment relevant sum and count
            i=partid.index(id)
            sum1[i]=sum1[i]+value
            count1[i]=count1[i]+1
        else:
            # if not already seen add it to list
            partid.append(id)
            sum1.append(value)
            count1.append(1)
# end of input
#
# output all results at the end
if len(partid)>0:
    # calc all averages using sum and count lists
    avg1=[0]*len(partid)
    avg1= [x/y for x,y in zip(sum1,count1)]
    #loop through the results and write each row
    for id in partid:
        i=partid.index(id)
        result.writerow([id,avg1[i],count1[i]])
)EOF';

In the external script using the MIX PARTITIONS strategy the main change from the default partition strategy is the need to create lists in python that keep track of the all part_id values and their corresponding sum1 and count1 values. The id of the current row of input is used as the key on all the lists ensuring the appropriate sum1 and count1 element are incremented for each row. The average is created using the avg1 list at the end of the processing prior to the output of results.

Using MIX PARITIONS means the code is often more complicated to write but this is outweighed by the savings in RAM utilisation in Kognitio when there is high concurrency or input data is very large. This is because MIX PARTITIONS does not need to sort the input data (using Kognitio RAM) before it is sent to the script invocations. It just ensures that all data for a given partition value is sent to the same invocation.

The external script above can be invoked using SQL similar to the other partition strategies:

SELECT p.product_name
    ,dt1.avg1 avg_price
    ,CAST(dt1.count1 as INTEGER) numtrans
FROM ( EXTERNAL SCRIPT AvgOverPartID_Mix
       FROM ( SELECT prodno, price/100.00
              FROM demo_ret.v_ret_sale
              WHERE storeno BETWEEN 1 AND 5)
     ) dt1
JOIN demo_ret.v_ret_product p
ON dt1.part_id = p.prodno
ORDER BY 3 DESC;

Note

results produced using the mix partition strategy may appear to take significantly longer to produce than in the case of default partitions. This is because all results are output at the end of the script after all data is processed, not on a partition by partition basis. The latter can deliver the first N results quicker but overall could be slower, depending on number of partitions and amount of input data. If running comparison tests between strategies make sure that all the result rows the external script sends back to Kognitio are written to a ram only table or a view image of output is created for fair timing comparisons.