Forum

Using Python in external scripts
Contributor
Offline
User avatar
Posts: 21
Joined: Mon Oct 07, 2013 12:15 pm

Filtering out empty results

by ChakLeung » Thu Jun 05, 2014 11:00 am

Hi all,

This was a common situation when users were initially introduced to external scripting so a brief solution was produced and is documented below.

Problem
Due to the nature of the parallel data distribution of Kognitio, in Python (as well as other languages) some instances of the script may receive no data and produce empty results. The amount of empty results would be dependent upon the configuration of the system such as the number of CPUs and RAM stores.

Example and solution
For example if a 128 core system were used with 15 partitions needed by the external script. There would be 128-15 = 113 partitions which received no data which is substantial and leads to redundancy in data streaming and storing as well as confusion when interpreting the results window as, in the mentioned example, the desired output would be overshadowed by the large amount of empty results.

As a solution consider the following script written in Python

Code: Select all

create external script <SCRIPT_NAME> environment PYTHON
receives (<ID> int, <VALUE> int) partition by <ID>
sends (<ID> int, Sum1 int)
script S'endofr(

#Import and set stream 
import csv, sys
reader = csv.reader(sys.stdin)
writer = csv.writer(sys.stdout)

id = 0
sum = 0
#Calculated a running total and record the ID
for row in reader:
    id = int(row[0])    
    sum += int(row[1])

flag=0
#Toggle the following 2 lines to filter results
#if id == 0 and sum == 0:
#    flag = 1
if flag == 0: 
    writer.writerow([id,sum])

)endofr';

select *
 from
 (external script <SCRIPT_NAME>
       from (select top 15 <ID>, <VALUE> from <SCHEMA>.<TABLE/VIEW>)) EXT1;  
NOTE: Tested on a 128 core system, adjust the number in the top statement for different systems to see the effect. Any number significantly less than the amount of cores would be ideal to show the difference in output.

The above code is simply reading in an ID and calculating a running total on the values that are sent in for each partition (in this case the partitioning is done on the ID column).
The place of interest in the code is the flag variables, toggle the 2 commented lines of code to see the difference. By setting a default flag of 0 and changing this flag if the variables to be written out are empty will prevent scripts from writing out results which had no data sent to them. This can also be done with a list or array ("if list =[]", essentially if the list is empty), instead of:

Code: Select all

if id == 0 and sum == 0:
This would be more practical in scripts which all the data from Kognitio has to be saved into a list before doing any calculations or outputting.
Reply with quote Top

Who is online

Users browsing this forum: No registered users and 1 guest

cron