Visiting my neighbour S3 and his son JSON

23

May
2017
Posted By : Chak Leung Comments are off

They also live across the river and there’s no bridge…so we’ll just make our own!

Amazon’s S3 is a popular and convenient storage solution which many, especially those with big data, tend to utilise, and the challenge can be connecting to this large store that has been building up over days/weeks/months/years. There are many ways to do this, no doubt, you could do a curl/wget but it’s in its raw form and converting it for use with databases isn’t always simple.
With Kognitio external connectors and tables you can connect to and parse it on the fly.

Let’s see how we can do this with JSON data and external tables in Kognitio where we’ll be able to use externally stored data as if they were local tables. We’ll also be utilising this in a later blog post where we’ll have some larger data.

Why would you want to use this though? What are the benefits?

•  Access data stored elsewhere without having to physically move them - streamlining the ETL process
•  Use data as if it were stored locally - no need to rewrite proprietary scripts and processes
•  Easily update with the newest data and improving access times using memory images

Convinced? Great! here’s what you’ll need to get started:

•  An S3 bucket with JSON data
•  Access and secret keys to S3 bucet
•  Kognitio with support for external tables

You can read more about external tables in chapter 8 of the Kognitio guide here.

In a nutshell the three typical components to this process are; stored data, a connector and an external table.

Let’s try it with our sample JSON file containing a customer’s form data.

{
    "firstName": "John",
    "lastName": "Smith",
    "age": 25,
    "address":
    {
        "streetAddress": "21 2nd Street",
        "city": "New York",
        "state": "NY",
        "postalCode": "10021"
    },
    "phoneNumber":
    [
        {
            "type": "home",
            "number": "212 555-1234"
        },
        {
            "type": "fax",
            "number": "646 555-4567"
        }
    ]
}

Source: https://www.sitepoint.com/customer-form-json-file-example/

The first step is to put this into S3, you can use Bash’s curl or something like the boto package in Python to do this. You can also use something like S3 browser to confirm that the file is in our bucket.

Now we can start building connections with external tables in Kognitio to access them. External tables don’t actually import the data from their source but bridge a connection to them. This enables us to access the data without having to go through arduous ETL processes to include the newest batch.

So firstly we need the bridge, an S3 based connector:

create connector my_test_connector source s3
target 'bucket <your_bucket> ,
accesskey <your_s3_access_key> ,
secretkey <your_s3_secret_key> ,
connectors_on_nodes "<node_name>",
proxy "<your_proxy_address>:<port>"';

The connectors_on_nodes is only if your nodes have restricted external access and you need to tell the connector which one can connect to the outside world. Same kind of situation with the proxy, use if required. Once you’ve created your connector, you can see it in the systems pane:

Connector in schema

Let’s see if our connector works, run:

external table from my_test_connector target 'list sample.json';

Files in S3

And we can see the file we placed in S3 and other information about it. Note that you can use the asterisk “*” like in Bash to include all JSON files e.g. “*.json”. If you have JSON files holding different information, you can always tag them with a prefix e.g. “prefix1*.json”. This will be useful when you need to build several external tables with different JSON sources.

Creating an external table uses the following syntax:

create external table . (
    sourcefile varchar(500)
    ,object_id int
    ,first_name varchar(20)
    ,last_name varchar(20)
    ,c_age int
)
from my_test_connector
target 'file "sample.json"
    ,conv_error_name "TEST_S3", ignore_invalid_records
    ,fmt_json 1, format "APPLY(firstvalid(?,null))
    ,inputbasename()
    ,objectnumber()
    ,firstName
    ,lastName
    ,age
';

The first part of the query encased in brackets defines the columns similar to typical table creation in SQL, this needs to match the definitions further down in the format string and the column definition needs to be suitable.
The next three lines dictate the connector you’re using, the target files, the error name for use with ipe_conv_error and invalid record handling.
The last part is telling the external table that we’ll be reading from the JSON format. This is done by setting fmt_json to 1 and then we need to set the format string to tell it what to look for. The APPLY() function used takes a list of functions and applies them to every column and the firstvalid() returns the first evaluated argument which does not cause an error. So “APPLY(firstvalid(?,null))” applies the firstvalid() function to every column where it nulls an entry in case it doesn’t exist instead of erroring. Lastly we define the columns we want it to look for. The inputbasename() and objectnumber() will put the filename and object number into columns which we’ve defined as sourcefile and object_id further above. Then we have JSON names themselves. Obviously JSON files can contain quite some depth so if you need to access deeper entries, so if you had a JSON file containing student details you use “Student[].year_group” which will get the year group from the student array. If year_group was another array inside “Student[]” then you can extend it in the same way, “Student[].year_group[].name”.
Now we can access the data. External tables appear alongside regular tables in the schemas but have a red icon instead, hovering over one with the mouse cursor will identify it as one:

External table in schema

What you might have noticed is that this depends on you knowing what’s inside your JSON file i.e. the entry names and with larger files, searching through them can be quite daunting. A feature to aid in this discovery would be prettyprint() which will return a readable view of what you supply to it e.g. prettyprint(address). This is defined with the rest of the column definitions too after format. Let’s try it on the address in our sample JSON file using an inline external table (returns results without saving it as an actual table):

external table (
    sourcefile varchar(500)
    ,object_id int
    ,first_name varchar(20)
    ,last_name varchar(20)
    ,c_age int
    ,addr_contents varchar(1000)
)
from my_test_connector
target 'file "sample.json"
,conv_error_name "TEST_S3", ignore_invalid_records
,fmt_json 1, format "APPLY(firstvalid(?,null))
    ,inputbasename()
    ,objectnumber()
    ,firstName
    ,lastName
    ,age
    ,prettyprint(address)
';

Pretty printing with JSON files

Now we can see the contents of the address and can pull things from there into their own columns. Try it for the phone number, you will need to add “[]” as it’s an array.
A tip for using data from external tables: the connector still needs to connect to S3 and parse the JSON data on the fly which can be slow considering the amount of variables involved such as connections, proxies, amount of data etc. This slow access certainly won’t be pleasant when we want to use it so what we can do is create a view image of this table.

create view <your_schema>.<your_view> as
    select * from <your_schema>.<your_table>;
create view image <your_schema>.<your_view>;

The view image is essentially a snapshot of the external table data in memory which can be renewed for newer data by simply recreating it. This can be done manually but it’s recommended that you create a bash script to submit the query via wxsubmit and then schedule this to run hourly/daily/weekly via something like cron. It’s also a good idea to do any sort of cleaning or transforming at this view creation stage instead of “select *” so that it’s ready to use.

Next time we’ll use this with a much larger data set including visuals with Tableau and insights with external scripts.

The loneliest railway station in Britain

18

May
2017
Posted By : Graeme Cole Comments are off
Tags :  ,,

In my last blog post, I introduced Kognitio’s ability to flatten complex JSON objects for loading into a table. Today we’ll look at another example using real-world Ordnance Survey data. We’ll also look at what you can do if the JSON files you need to load are in HDFS. We’ll use these techniques to solve the following example problem…

Which railway station in mainland Britain is the furthest straight-line distance from its nearest neighbour? The fact that the answer is Berwick-upon-Tweed may surprise you!

(more…)

External tables and scripts and user mapping

27

Apr
2017
Posted By : Ben Cohen Comments are off
external scripts, tables, user mapping, sql query
Categories :Blog, Kognitio How To

Introduction

Kognitio has two mechanisms for external programs to provide or operate on data during an SQL query: external tables and external scripts. These can be written in the user’s preferred language, for example R, Python, shell or even C.

External tables are used like normal tables in SQL queries but the rows are supplied by an external “connector” program rather than being stored by WX2. The connector could generate the rows itself (TPC-H data, for example) or fetch them from another database, HDFS or an internet data source.

External scripts are table functions in SQL which take a subquery “argument” and return a table. They allow users to create complex analytics or achieve functionality that the database itself doesn’t provide. The script is run under the “script environment” interpreter with the subquery rows as input, and the output rows will be returned to the SQL query.

Here is an example of an external table and an external script. There are more examples in
wx2/current/examples/excon_* and wx2/current/examples/exscripts.sql under your WX2 install directory.

-- External table to return the current environment for each connector instance.
create connector EXCON_ENV
    -- adjust this path if required
    command '/opt/kognitio/wx2/current/examples/excon_env';
create external table ENV (CONNECTOR_NUM int, 
                           NAME varchar(10000),
                           VALUE varchar(10000))
    from EXCON_ENV;

-- Results: environment of each connector
--          e.g. (0, 'WX2_TNO', '629'), (0, 'WX2_METHOD', 'LOAD'), ...
select * from ENV;

-- External script to perform an arbitrary regular expression substitution
-- using the stream editor, sed.
create script environment SED command '/bin/sed -f';
create external script REGEXP environment SED
    receives (TEXT varchar) input 'format "%{quoted=0}s\\n"'
    sends (SUBST_TEXT varchar) output 'format "%0c\\n"'
    script 's/ko\(.*\)/i\1n/';

-- Results: ('ignition')
external script REGEXP from (select 'kognitio');

Kognitio on Hadoop

The following sections on user mapping apply only to Kognitio running in appliance mode.

For Kognitio on Hadoop, external connectors and scripts currently run as the same user that runs the Kognitio on Hadoop processes. This is expected to change in the future to a broadly similar scheme tailored to Hadoop.

Why map users?

The examples above are simple so it doesn’t matter which Linux user runs the external script or connector processes, but it can do in practice. You might want your script to run as a user that has already been set up, perhaps with SSH keys or a uid needed for NFS.

It is also preferable not to execute user-supplied code as a privileged user – or in this case one with access to database files.

Kognitio chooses the Linux user based on the Kognitio user and the query context. By default – unless it has been configured otherwise as described below – it will use the user wxextern, which is created when WX2 is installed. (But wxextern is not created automatically by upgrade from versions earlier than 8.1.)

Allowed Linux users

Kognitio doesn’t simply allow scripts or connectors to run as any Linux user. By default only wxextern is allowed. Any other user must be allowed by name or uid (user id) by editing the “[security]” section of the config file as follows.

A Linux user is allowed if:

  1. The user name is listed in the config file setting “ext_users_allowed” comma-separated list of user names; or
  2. The config file settings “min_ext_user_id” and “max_ext_user_id” are set to positive numbers with min_ext_user_id <= uid <= max_ext_user_id, and the user must not be listed in “ext_users_denied” comma-separated list of user names.

(By default ext_users_allowed is “$extern” and ext_users_denied is “root“. Strictly, in (2) only one of min_ext_user_id and max_ext_user_id needs to be non-negative, with the inequality ignored for the negative value.)

You need to restart WX2 (“wxserver start“) after changing any of these config file settings for them to take effect.

User map strings

Kognitio attempts to map from a Kognitio user to a Linux user using a user map string. This is a comma-separated list of any of the following elements:

User map element Linux user to match
name User name
& Same as Kognitio user (e.g. SYS)
$extern Kognitio extern user, by default wxextern
$admin Kognitio admin user, by default wxadmin
$master Kognitio master user, by default wxroot
$owner Kognitio owner user, by default root

It will try each name in turn and will use the first one which, when converted to lower case, gives the name of a Linux user which is allowed by the security parameters. If no valid match is found then it will fail with the error “ET0111: User mapping for the external operation was invalid”.

For example, for the user map string “kognitio,&,$extern” for a Kognitio user called MEPHISTOPHELES it will try Linux users kognitio, mephistopheles and wxextern.

Note that this is evaluated separately on each node, so it is recommended that the Linux user names and uids are synchronised across the cluster. Otherwise it is possible for different user names to be matched, the same name being matched but the uids differing, or names being matched on only a subset of nodes (the ET0111 error will be raised in this case).

If an ET0111 error is raised the Kognitio user and the user map string are logged, as of version 8.2, in the server debug log file (look at `wxlogd smd`/serverdbg.*), which should be helpful to see why it didn’t work. For example:

T_2017-04-21_09:06:12_BST: ET: Invalid user mapping (ET0111) on 'hp-rack2-enc2-11' for 'SYS' -> 'kognitio,&,$extern'

Which user map string is used?

When Kognitio is about to launch an external script or connector it obtains a user map string. It first looks in IPE_EXTERNAL_USER_MAP for a user map string based on the execution context; if it doesn’t find one then it uses at the config file setting “[boot options] default_usermap“, which defaults to “&,$extern“. It then tries to map the current Kognitio user to a Linux user as described above.

For external scripts the search is performed in this order matching against the columns (OBJECT_TYPE, OBJECT_ID, USER_ID) of IPE_EXTERNAL_USER_MAP.

PRIVDOMAIN_EXT_SCRIPT(13)Script Id-1PRIVDOMAIN_SCRIPT_ENV(10)Script Environment IdUserIdPRIVDOMAIN_SCRIPT_ENV(10)Script Environment Id-1PRIVDOMAIN_EXT_SCRIPT(13)-1User IdPRIVDOMAIN_EXT_SCRIPT(13)-1-1-1-1User Id

OBJECT_TYPE OBJECT_ID USER_ID
PRIVDOMAIN_EXT_SCRIPT(13) Script Id UserId
-1 -1 -1

For external tables the search is performed in this order matching against the columns (OBJECT_TYPE, OBJECT_ID, USER_ID) of IPE_EXTERNAL_USER_MAP.

PRIVDOMAIN_TABLE(3)Table Id-1PRIVDOMAIN_CONNECTOR(9)Connector IdUser IdPRIVDOMAIN_CONNECTOR(9)Connector Id-1PRIVDOMAIN_TABLE(3)-1User IdPRIVDOMAIN_TABLE(3)-1-1-1-1User Id

OBJECT_TYPE OBJECT_ID USER_ID
PRIVDOMAIN_TABLE(3) Table Id User Id
-1 -1 -1

The administrator can add and remove mappings in IPE_EXTERNAL_USER_MAP using this special syntax.

CREATE EXTERNAL USER MAP
  [OBJECT_TYPE=[EXTERNAL SCRIPT|SCRIPT ENVIRONMENT|EXTERNAL TABLE|CONNECTOR],]
  [OBJECT_ID = 1234,]
  [USER_ID = 1234,]
  USERMAP = '<user map>';

DROP EXTERNAL USER MAP
  [OBJECT_TYPE=[EXTERNAL SCRIPT|SCRIPT ENVIRONMENT|EXTERNAL TABLE|CONNECTOR],]
  [OBJECT_ID = 1234,]
  [USER_ID = 1234];

Simple performance checks against your hardware cluster

24

Mar
2017
Posted By : Simon Darkin 1 Comment
performance hardware cluster, cpu. benchmarks

Kognitio have a lot of experience commissioning clusters of new hardware for our MPP software product. As part of that process, we’ve developed a number of steps for validating the performance of new clusters, and these are the topic of this blog entry.

 

There are many Linux based benchmarking tools on the market however they are not usually installed by default, in which case some simple command line tools can be used to quickly establish if there is a potential hardware issue that warrants further investigation.    The following hardware components are covered by this topic:

  • CPU
  • Disk
  • Networking
  • RAM

 

CPU   

A slow CPU or core could have an adverse effect on query performance, and so with the use of basic command line tools you can help identify laggards.  A background ‘for’ loop can be employed to ensure all cores/threads are tested simultaneously.

 

Integer arithmetic test

Invoked 8 times to run simultaneously against 8 cores

 

for i in `seq 1 8`; do time $(i=0; while (( i < 999999 )); do (( i ++ )); done)& done; wait

 

this will return the time taken to increment an integer over the specified range.  A comparison of the time taken by each core will help identify outliers

 

real    0m8.908s
user    0m8.765s
sys     0m0.140s

real    0m8.943s
user    0m8.789s
sys     0m0.156s

real    0m8.997s
user    0m8.761s
sys     0m0.112s

real    0m9.000s
user    0m8.853s
sys     0m0.144s

real    0m9.023s
user    0m8.881s
sys     0m0.140s

real    0m9.028s
user    0m8.861s
sys     0m0.168s

real    0m9.034s
user    0m8.857s
sys     0m0.176s

real    0m9.073s
user    0m8.781s
sys     0m0.156s

 

Whilst the test is running you can check that each core is under load by running top and expanding the output to show all cores.  If you do encounter outliers in the arithmetic test then you can use the output from top to identify which core(s) remain busy when others have finished

 

Cpu0  : 98.3%us,  1.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu1  : 99.0%us,  1.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu2  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu3  : 99.3%us,  0.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu4  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu5  : 98.0%us,  2.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu6  : 98.3%us,  1.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu7  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

 

Compression test

As with the arithmetic test this example loops around 8 times so that 8 cores are tested simultaneously.  Data is written to /dev/null to avoid any overhead associated with disk IO.

 

for i in `seq 1 8`; do dd if=/dev/zero bs=1000 count=1000000 | gzip >/dev/null&  done; wait

 

this will return the rate at which each core is able to compress 1 GB of data

 

1000000000 bytes (1.0 GB) copied, 11.9277 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9277 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9545 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9799 seconds, 83.5 MB/s

1000000000 bytes (1.0 GB) copied, 11.9831 seconds, 83.5 MB/s

1000000000 bytes (1.0 GB) copied, 12.0085 seconds, 83.3 MB/s

1000000000 bytes (1.0 GB) copied, 12.0382 seconds, 83.1 MB/s

1000000000 bytes (1.0 GB) copied, 12.2655 seconds, 81.5 MB/s

 

With Kognitio software installed you can use the wxtool command  to run the compression test simultaneously against all database nodes to aid comparison across the cluster as a whole.  You can download the software for free at http://kognitio.com/free-download/

 

wxtool -a '{can DB}' -S 'for i in `seq 1 8`; do dd if=/dev/zero bs=1000 count=1000000 | gzip >/dev/null&  done; wait'

 

For node kap1-1 (ecode 0, 866 bytes):

1000000000 bytes (1.0 GB) copied, 11.9422 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9659 seconds, 83.6 MB/s

1000000000 bytes (1.0 GB) copied, 11.9876 seconds, 83.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.0142 seconds, 83.2 MB/s

1000000000 bytes (1.0 GB) copied, 12.1293 seconds, 82.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.3754 seconds, 80.8 MB/s

1000000000 bytes (1.0 GB) copied, 12.4132 seconds, 80.6 MB/s

1000000000 bytes (1.0 GB) copied, 12.4386 seconds, 80.4 MB/s

For node kap1-3 (ecode 0, 864 bytes):

1000000000 bytes (1.0 GB) copied, 11.8398 seconds, 84.5 MB/s

1000000000 bytes (1.0 GB) copied, 11.8661 seconds, 84.3 MB/s

1000000000 bytes (1.0 GB) copied, 11.8893 seconds, 84.1 MB/s

1000000000 bytes (1.0 GB) copied, 11.9165 seconds, 83.9 MB/s

1000000000 bytes (1.0 GB) copied, 11.946 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.953 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9637 seconds, 83.6 MB/s

1000000000 bytes (1.0 GB) copied, 12.2996 seconds, 81.3 MB/s

For node kap1-3 (ecode 0, 866 bytes):

1000000000 bytes (1.0 GB) copied, 11.8757 seconds, 84.2 MB/s

1000000000 bytes (1.0 GB) copied, 11.8846 seconds, 84.1 MB/s

1000000000 bytes (1.0 GB) copied, 11.9178 seconds, 83.9 MB/s

1000000000 bytes (1.0 GB) copied, 11.9243 seconds, 83.9 MB/s

1000000000 bytes (1.0 GB) copied, 11.9377 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9834 seconds, 83.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.3367 seconds, 81.1 MB/s

1000000000 bytes (1.0 GB) copied, 12.3942 seconds, 80.7 MB/s

For node kap1-4 (ecode 0, 864 bytes):

1000000000 bytes (1.0 GB) copied, 11.91 seconds, 84.0 MB/s

1000000000 bytes (1.0 GB) copied, 11.9291 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9448 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9498 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 12.1232 seconds, 82.5 MB/s

1000000000 bytes (1.0 GB) copied, 12.3896 seconds, 80.7 MB/s

1000000000 bytes (1.0 GB) copied, 12.4449 seconds, 80.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.4504 seconds, 80.3 MB/s

 

 

 

DISKS

Having just one underperforming disk in the system can significantly impact query performance against disk based tables. Here are some simple tests to help identify any anomalies.

 

Iterative write speed test with dd.

 

WARNING: THIS WRITE TEST WILL OVERWRITE DATA AT THE SPECIFIED DESTINATION AND SO MAKE SURE YOU AVOID AREAS THAT CONTAIN SYSTEM DATA OR USER DATA THAT YOU WISH TO KEEP

 

for i in `seq 1 3`; do echo "Loop $i"; dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=10000 count=100000 conv=fsync; echo ""; done

 

this will return the duration and rate at data can be written out to disk.  In this example 1 GB of data is repeatedly written to a raw partition.  Note that fsync is used to flush the writeback cache and ensures data is written to the physical media.

 

Loop 1
100000+0 records in
100000+0 records out
1000000000 bytes (1.0 GB) copied, 13.6466 seconds, 73.3 MB/s

Loop 2
100000+0 records in
100000+0 records out
1000000000 bytes (1.0 GB) copied, 12.8324 seconds, 77.9 MB/s

Loop 3
100000+0 records in
100000+0 records out
1000000000 bytes (1.0 GB) copied, 12.4271 seconds, 80.5 MB/s

 

With Kognitio software installed, the test can be expanded to run on all database nodes allowing for easy comparison of all disks in the system

 

wxtool -a '{can DB}' -S 'for i in `seq 1 3`; do echo "Loop $i"; dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=10000 count=100000 conv=fsync; echo ""; done'

 

Iterative read speed test with dd

 

for i in `seq 1 3`; do let skip=$i*5000; echo "Loop $i - skip = $skip"; sync ; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/cciss/c0d0p2 of=/dev/null bs=1000 count=1000000 skip=$skip ;echo ""; done

 

this will return the rate at which data can be read from disk.  In this example 1 GB of data is being read from a raw partition, adjusting the offset and flushing the buffer on each iteration to ensure data is being read from the physical media.

 

Loop 1 - skip = 5000
1000000+0 records in
1000000+0 records out
1000000000 bytes (1.0 GB) copied, 14.4355 seconds, 69.3 MB/s

Loop 2 - skip = 10000
1000000+0 records in
1000000+0 records out
1000000000 bytes (1.0 GB) copied, 12.9884 seconds, 77.0 MB/s

Loop 3 - skip = 15000
1000000+0 records in
1000000+0 records out
1000000000 bytes (1.0 GB) copied, 12.6045 seconds, 79.3 MB/s

 

With Kognitio software installed, the test can be expanded to run on all database nodes to aid comparison across the entire system.

 

wxtool -a '{can DB}' -S 'for i in `seq 1 3`; do let skip=$i*5000; echo "Loop $i - skip = $skip"; sync ; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/cciss/c0d0p2 of=/dev/null bs=1000 count=1000000 skip=$skip ;echo ""; done'

 

Iterative read speed test with hdparm

 

for i in `seq 1 3`; do echo "Loop $i"; hdparm --direct -t /dev/cciss/c0d0p2; echo ""; done

 

this will return the rate at which data can be read sequentially from disk without any file system overhead.

 

Loop 1
/dev/cciss/c0d0p2:
Timing O_DIRECT disk reads:  236 MB in  3.01 seconds =  78.40 MB/sec

Loop 2
/dev/cciss/c0d0p2:
Timing O_DIRECT disk reads:  236 MB in  3.02 seconds =  78.09 MB/sec

Loop 3
/dev/cciss/c0d0p2:
Timing O_DIRECT disk reads:  230 MB in  3.01 seconds =  76.30 MB/sec

 

With Kognitio software installed, the test can be expanded to run on all database nodes to aid comparison across the entire system.

 

wxtool -a '{can DB}' -S 'for i in `seq 1 3`; do echo "Loop $i"; hdparm --direct -t /dev/cciss/c0d0p2; echo ""; done'

 

Disk based table scan

 

If the cluster is running Kognitio database software you can initiate a scan of a large disk based table and review the output from wxtop in order to spot any disk store processes that remain busy for a significant period after others have finished.   For accurate results you should ensure there is no concurrent activity when performing this test.

 

select *
from <large disk based table>
where <condition unlikely to be true>;

 

Monitor the output from wxtop and look out for any disk store processes that remain busy when all or most others have finished.

 

PID       NODE        PROCESS                           SIZE      TIME
15784       kap1-1      WXDB(55): Diskstore             258036       100
22064       kap1-2      WXDB(18): Diskstore             257176        86
25179       kap1-3      WXDB(73): Diskstore top         258200        84
31237       kap1-4      WXDB(37): Diskstore             258068        77

 

If a disk store process does appear to lag behind, then you should eliminate the possibility of it being attributable to data skew by checking the row counts across all of the disks using the following query

 

select
mpid,
sum(nrows) nrows
from ipe_ftable
where table_id = <table_id being scanned>
group by 1
order by 2 desc;

 

 

NETWORKING

You can test the network links between nodes using some simple netcat commands.  This will allow you to spot links that are underperforming.

 

Link speed test using dd and netcat

 

The name and options associated with the netcat binary will depend on the Linux installation, however with Kognitio software installed you can use wxnetread and wxnetwrite for the data transfer regardless.

 

Setup a listening process on the node performing the read

 

netcat -l -p 2000 > /dev/null &

 

use dd to generate some data and pipe through netcat to the IP and port of the node performing the read

 

dd if=/dev/zero bs=1000 count=1000000 | netcat 14.4.21.2 2000

 

this will return the rate at which data can be copied over the link

 

1000000000 bytes (1.0 GB) copied, 8.54445 seconds, 117 MB/s

 

The same test as above this time using wxnetread/wxnetwrite

 

wxnetread -l -p 2000 > /dev/null &

 

dd if=/dev/zero bs=1000 count=1000000 | wxnetwrite 14.4.21.2 2000

 

1000000000 bytes (1.0 GB) copied, 8.5328 seconds, 117 MB/s

 

Shape tests

With Kognitio software installed you can run Shape tests to measure the speed at which RAM based data is re-distributed between nodes

 

wxtester -s <dsn> -u sys -p <password> -Ishape 5000 9000 1

 

Once the tests have been running for a few minutes you can navigate to the logs directory and check the data rate

 

cd `wxlogd wxtester`
grep TSTSHN results | gawk '{ if ($5==64) print ((($3*$5)/$6)/<number of database nodes>)/1048576 }'

 

With older generation hardware you can expect to see performance of  40MB/s/node given sufficient network bandwidth.       With newer hardware, for example HP Gen9 servers with 2x 56Gb/s links per node this increases to 90MB/s/core.

 

 

RAM

Benchmarking RAM performance is best left to a dedicated test suite, however you can perform a very simple write/read speed test using dd in conjunction with a temporary file storage facility in RAM, which at the very least can show up a mismatch in performance between nodes.

 

Write and read speed test using dd

 

mkdir RAM
mount tmpfs -t tmpfs RAM
cd RAM
dd if=/dev/zero of=ram_data bs=1M count=1000

 

this will return the rate at which data is written to RAM

 

1048576000 bytes (1.0 GB) copied, 1.06452 seconds, 985 MB/s

 

dd if=ram_data of=/dev/null bs=1M count=1000

 

this will return the rate at which data is read from RAM

 

1048576000 bytes (1.0 GB) copied, 0.6346 seconds, 1.7 GB/s

 

Participate in the Kognitio Console beta test program

10

Mar
2017
Posted By : Michael Atkinson Comments are off
Kognitio console, beta test program
Categories :Blog, Kognitio How To

Kognitio Console is Kognitio’s client side management program for the Kognitio Analytical Platform.

Some of its features are:

  • It allows inspection of the metadata tree for schemas, tables, views, external scripts, users, connectors, sessions, queues, etc.
  • It also gives an object view to each of these metadata objects, allowing their inspection and management.
  • There are also lots of tools, wizards and widgets to browse data in Hadoop, load and unload data, identify problem queries and many more.
  • There are also a set of reports and dashboards to monitor the state of Kognitio systems.
  • Macros may be written to extend Kognitio Console, the reports and dashboards are written in these XML macros.
  • Ad-Hoc queries may be executed.
  • KogScripts may be executed and debugged. KogScript is based on Lua but has Kognitio specific enhancements to run SQL natively.
  • It is an integrated development environment (IDE) for KogScripts, and external scripts running in Kognitio.

All this makes Kognitio Console a versatile tool, suitable for database admins, analysts and power users.

Kognitio Console is in constant development, beta and release candidate versions are provided from our update website. Beta and release candidates will be announced on the Kognitio Console forums

There are two ways of obtaining these betas:

Note that this is beta software, so has not gone through a full QA. However, we endeavour to make sure it has no obvious bugs and it will have passed our Console smoke tests. If you experience crashes or other bugs or deficiencies in these beta and release candidates, please report them to us, this will help us make the next release more stable.

By installing it into a different location than the default, you can have both this and the last stable releases installed at the same time.

The differences between KogScript running in Console and kogscript.exe

24

Feb
2017
Posted By : Michael Atkinson Comments are off
KogScript, Lua, SQL, programming, how to, console window
Categories :Blog, Kognitio How To

KogScript is a general purpose programming language based on Lua. It may be run within Console, which acts as an integrated development environment for it, or run from the command line tool kogscript.exe.

The KogScript running within Console and via kogscript.exe are compiled from the same source files and are identical with the following exceptions.
 

sqldebug

In kogscript.exe the global variable sqldebug is set to 3. The sqldebug variable is used to control how much information is output on each sql query within the KogScript as it is executed. A level of 3 shows the query status, and table contents.

In Console sqldebug is not defined, so it acts as if it had a value of 0, this means no information is output to the terminal on sql execution, instead Console has integration of queries within the script view as info and table and stored in the script history.

withinConsole

When running within Console the global variable withinConsole is set to true
 

Console integration

When running within Console there are two extra global tables holding data and methods used to integrate scripts with Console (within the table kog) and access Qt objects and methods (within the table qt).

An example of how to use the qt methods may be found in the Examples tool as “Using Qt from Kog Scripts”. This shows how to open Qt dialogs, and use Qt grids and views.

The kog global table holds functions to access the main elements of the Console application user interface. From those it is possible to access the rest of the user interface. There are about 100 classes and thousands of objects available. To find out which methods, variables and enumerations are available for one of the kog functions use list(), as in:

mw = kog.mainWindow()
list(mw)

Which gives outout like:

 

Loader and Unloader

To run either of these from KogScript, the script needs to be run within Console as they make use of Console integration to run the Loader or Unloader tools built into Console. The example below shows how to run Unloader from within Console.

-- Get a reference to the console main window.
mw = kog.mainWindow();

config =[[
FILENAME=C:/TEMP/simplevector_test.csv
TABLE="MIKE"."A_SIMPLE_VECTOR"
AUTOUNLOAD=1
SILENT=1]];

-- Now run the unloader using the configuration above. 
mw:runUnloader(config)

 

Conclusion

KogScript within Console and the kogscript.exe command line tools are identical as far as running SQL queries and the Lua language.

Where they differ is that within Console there are extra features to enable integration with Console itself.

How to generate an identity / auto_increment column in Kognitio

17

Feb
2017
Posted By : Michael Atkinson Comments are off
Categories :Blog, Kognitio How To

Occasionally we would like the value of the primary key field to be created automatically every time a new record is inserted. That is we would like an auto-incrementable field in a table.

What other databases do

Other databases use a variety of syntax to do this.

MySQL uses an AUTO_INCREMENT keyword, which is used within the table definition

CREATE TABLE fruit (
         ID int NOT NULL AUTO_INCREMENT,
         name varchar(255) NOT NULL,
         PRIMARY KEY (ID) );

 

The SQL Server syntax uses the IDENTITY keyword, again within the table definition, in this example starting at 1 and incrementing by 1.

CREATE TABLE city (
         ID int IDENTITY(1,1) PRIMARY KEY,
         name varchar(255) NOT NULL,
         PRIMARY KEY (ID) );

 

For Oracle you first need to create a sequence and then use its nextval attribute during inserts.

CREATE SEQUENCE seq_people
        MINVALUE 1
        START WITH 1
        INCREMENT BY 1
        CACHE 10;
INSERT INTO People (ID,FirstName,LastName)
VALUES (seq_people.nextval,'Mike','Atkinson');

 

Kognitio’s row_number syntax.

Kognitio does not have a way of creating a unique value for a field directly in the table definition, instead it has two ways of adding unique values to resultsets; and then these resultsets may be inserted into a table. The first uses the ROW_NUMBER function.

The ROW_NUMBER function assigns to each row in the partition or query a sequence number starting from one. This is only a number used in the context of the result set, if the result changes, the ROW_NUMBER will change. The ROW_NUMBER expression takes an ORDER BY statement with the column to be used for the row count with an OVER operator.

In the example below we use ROW_NUMBER to replace the subject name with an id value which is a foreign key reference into the subjects table.

set schema yorschema;
create table facts_origin (subject varchar, predicate varchar, object varchar);
create table subjects (id int, name varchar not null, primary key(id) );
create table facts (subject int, predicate varchar, object varchar, 
               foreign key (subject) references subjects(id) );

insert into facts_origin values ('Peter', 'Brother', 'Sarah'), 
                                                           ('Peter', 'Brother', 'Anne'), 
                                                           ('John', 'Father', 'Peter');


insert into subjects
select with
    ids(name) as (select distinct subject from facts_origin)
select row_number() over (order by name) as "id", name  from ids;

insert into facts
    select s.id, f.predicate, f.object
        from facts_origin f, subjects s
        where f.subject=s.name;

select * from subjects;
select * from facts;
Which gives a subjects table of: And a facts table of:
ID NAME
1 John
2 Peter
SUBJECT PREDICATE OBJECT
1 Father Peter
2 Brother Anne
2 Brother Sarah

 

Kognitio’s generate_key syntax.

For an individual SELECT statement the GENERATE_KEY function generates a unique key for each row. These rows may then be inserted into a table.

The result is an INT8 data type with a value greater than or equal to zero. The results are not typically contiguous, but are guaranteed to be unique for an individual SELECT statement. The values generated are dependent upon the number of Kognitio nodes and the distribution of the data. Rerunning a query may not generate the same results. Multiple occurrences of GENERATE_KEY in a SELECT list will all produce the same result within a single row.

GENERATE_KEY can only be used in the SELECT list, it cannot be used in WHERE, HAVING, GROUP BY or ORDER BY clauses. Within the SELECT list you can perform arithmetic on the columns containing the GENERATE_KEY function.

insert into subjects
select with
    ids(name) as (select distinct subject from facts_origin)
select generate_key(), name  from ids;

The Kognitio SQL Guide has examples of good practice for ensuring that addition INSERT statements have unique keys and how to use them in ETL processes.

 

A more complete example

The following example shows many techniques, it uses both row_number and generate_key functions and illustrates how to use them to add entries to existing tables.

It also illustrates the use of KogScript functions, KogScript variables and a more complex “Grandparent” query.

set schema yourschema;

-- drop tables, they may have been created by previous runs of the script
drop table facts_origin;
drop table facts;
drop table entities;
drop table preds;

-- create tables to use
create table facts_origin (subject varchar, predicate varchar, object varchar);
create table entities (id int, name varchar not null, primary key(id) );
create table preds (id int, name varchar not null, primary key(id) );
create table facts (subject int, predicate int, object int,
               foreign key (subject) references entities(id),
               foreign key (predicate) references preds(id),
               foreign key (object) references entities(id) );

-- add some facts, a small genealogy
insert into facts_origin
     values ('Peter', 'Brother', 'Sarah'), ('Peter', 'Brother', 'Anne'), ('John', 'Father', 'Peter');
insert into facts_origin
     values ('Peter', 'Father', 'James'), ('Jane', 'Mother', 'Peter'), ('Mary', 'Mother', 'John');

-- create the entities table
insert into entities
select with
    ents(name) as (select distinct subject from facts_origin union select object from facts_origin)
select row_number() over (order by name) as "id", name  from ents;


-- create the preds table
insert into preds
select with
    ids(name) as (select distinct predicate from facts_origin)
select generate_key(), name  from ids;

-- create the facts table, this uses references to the entities and preds tables
insert into facts
    select s.id, p.id, o.id
        from facts_origin f, entities s, preds p, entities o
        where f.subject=s.name and f.predicate=p.name and f.object=o.name;

-- check the tables are what we expect
select * from entities;
select * from preds;
select * from facts;

-- Query for all facts where 'Peter' is the subject
with
  -- select the subject id
  s(id) as (select id from entities where name='Peter'),
  -- Filter the facts table for that subject id
  f(subject, predicate, object) as (select subject, predicate, object from facts, s 
                        where subject in (s.id) )
  -- Then join the filter results with the subjects table to get the subject name
select s.name as "SUBJECT", p.name, o.name as "OBJECT"
    from f, entities s, preds p, entities o 
    where f.subject=s.id and f.predicate=p.id and f.object=o.id;


-- Use a function to add entities if they do not already occur in the entities table.
function addEntity(p)
   local t = select id from entities where name='$p';
   if #t.rows~=1  then
       insert into entities
           select mk + 1, '$p' from (select max(id) from entities) dt(mk);
                 t = select id from entities where name='$p';
       assert(#t.rows==1)
   end
   return t.rows[1][1];
end

-- Use a function to add predicates if they do not already occur in the preds table.
function addPred(p)
   local t = select id from preds where name='$p';
   if #t.rows~=1  then
       insert into preds
           select generate_key() + mk + 1, '$p' from (select max(id) from preds) dt(mk);
                 t = select id from preds where name='$p';
       assert(#t.rows==1)
   end
   return t.rows[1][1];
end


-- adding facts
function addFact(s,p,o)
    local sub = addEntity(s) – use local to avoid polluting the global namespace
    local pred = addPred(p)
    local obj = addEntity(o)
    local t = select count(*) as "count" from facts
                               where subject=$sub and predicate=$pred and object=$obj;
    if #t.rows~=1 or t.rows[1].count==0 then
         insert into facts values ($sub,$pred,$obj);
    end

end

addFact('Garry', 'Father', 'Mary')


-- grandparent view.

parent = addPred('Parent')
grandparent = addPred('Grandparent')

-- Now form the query as a view, so that it may be reused easily
create view grandparent
select with
    -- Define the father_of subquery
    father_of(s,p,o) as  (select subject, predicate, object from facts
                 where predicate in (select id from preds where name='Father') ),
    -- Define the mother_of subquery
    mother_of(s,p,o) as (select subject, predicate, object from facts
                 where predicate in (select id from preds where name='Mother') ),
    -- Define the parent_of subquery, this uses the Parent predicate id we have defined above
    parent_of(s,p,o) as  (select s,$parent,o from father_of union
                 select s,$parent,o from mother_of),
    -- Define the grandparent_of subquery, this uses the Grandparent predicate id we 
    -- have defined above
    grandparent_of(s,p,o) as (select p.s, $grandparent, c.o from parent_of p, parent_of c where p.o=c.s)
-- Now output the result in a human readable form
select s.name as "SUBJECT", p.name as "PREDICATE", o.name as "OBJECT"
    from grandparent_of f, entities s, preds p, entities o 
    where f.s=s.id and f.p=p.id and f.o=o.id;


-- Now we can perform various queries
-- Who are John's grandchildren?
select object as "Grandchild" from grandparent where subject='John';

-- Who are Peter's grandparents?
select subject as "Grandparent" from grandparent where object='Peter';

-- Is Jane the grandparent of James?
select 'Yes' from grandparent where subject='Jane' and object='James';

And here is the entity diagram for the facts table

 

Summary

Various databases have different ways of creating unique values in a field. The Kognitio Analytical Platform has two methods GENERATE_KEY and ROW_NUMBER, both of which generate unique values in a subquery that may then be inserted into a table, giving a column in the table with a unique value as a result.

Kits, cats, sacks, wives: unfolding JSON data into tables

14

Feb
2017
Posted By : Graeme Cole Comments are off
unfolding json data into tables, csv
Categories :Blog, Kognitio How To
Tags :  ,,

Each wife had seven sacks…

Converting simple, flat JSON data into CSV or another tabular format is a problem addressed by many freely-available JSON-to-CSV converters. However, these are usually only useful when the JSON is a series of flat objects without any nested arrays.

In practice, real-world JSON files often have nested arrays containing more objects, each of which may contain arrays of their own. They tend to look like the old riddle As I was going to St Ives:

As I was going to St. Ives,
I met a man with seven wives.
Each wife had seven sacks,
Each sack had seven cats,
Each cat had seven kits.
Kits, cats, sacks, wives,
How many were there going to St. Ives?

(more…)

What’s new in Kognitio Console version 80199s170123

07

Feb
2017
Posted By : Michael Atkinson Comments are off
Whats new in Kognitio Console
Categories :Blog, Kognitio How To

SQL in expression contexts

 Up until this release SQL could only occur in Lua statements or assignments. For example in Kognitio Scripts we can write SQL within the script as a statement or assigning the results table to a variable.

drop table hr.employee;
t = select * from sys.ipe_user; 

This release of Kognitio Console adds the ability to place SQL in Lua expression contexts. So now within Kognitio Scripts we can place SQL in most places where a Lua expression may occur.

In the example below the data for the Kognitio users are read from the table sys.ipe_user as a Lua table. The rows element of that table is then passed to the apply function which passes each row in turn to the printRow function. The printRow function prints the NAME element of the row to the console, which is attached to the Kognitio Console Script Variables pane.

function apply(t, fn)
    for i, row in pairs(t) do fn(row) end
end
 
function printRow(row) print(row.NAME) end
 
apply( (select * from sys.ipe_user).rows, printRow) 

This feature currently has the limitation that it is not easy for the parser to determine where the SQL query ends. The code below fails because the ‘,’ character does not end an SQL query so it thinks that the , printRow is part of the query. The failure is at run time when the SQL is parsed in the Kognitio Analytical Platform and the SQL select * from sys.ipe_user, printRow errors.

function apply(t, fn)
    for i, row in pairs(t.rows) do fn(row) end
end
 
function printRow(row) print(row.NAME) end
 
apply(select * from sys.ipe_user, printRow) -- ERRORS: , does not end the SQL query

Instead we use ‘;’ to end the SQL expression:

apply(select * from sys.ipe_user; printRow) – Note use of ; to end SQL

Of particular use while debugging scripts is the form:

– any expression of the form print(<sql query>), e.g.
print(select * from sys.ipe_user) 

which outputs the formatted query results to Kognitio Console “Script Variables” pane.

Lua lfs module

The Lua File System module adds common file system manipulation commands to Lua. The lfs module is adapted from the Kepler project  and offers a portable way to access the underlying directory structure and file attributes.

lfs.attributes (filepath [, aname | atable])

Returns a table with the file attributes corresponding to filepath (or nil followed by an error message in case of error). If the second optional argument is given and is a string, then only the value of the named
attribute is returned (this use is equivalent to
 lfs.attributes(filepath)[aname], but the table is not created and only one attribute is retrieved from the O.S.). if a table is passed as the second argument, it is filled with attributes and returned instead of a new table. The attributes are described as follows; attribute mode is a string, all the others are numbers, and the time related attributes use the same time reference of os.time:

dev

on Unix systems, this represents the device that the inode resides on. On Windows systems, represents the drive number of the disk containing the file

ino

on Unix systems, this represents the inode number. On Windows systems this has no meaning

mode

string representing the associated protection mode (the values could be file, directory, link, socket, named pipe, char device, block device or other)

nlink

number of hard links to the file

uid

user-id of owner (Unix only, always 0 on Windows)

gid

group-id of owner (Unix only, always 0 on Windows)

rdev

on Unix systems, represents the device type, for special file inodes. On Windows systems represents the same as dev

access

time of last access

modification

time of last data modification

change

time of last file status change

size

file size, in bytes

permissions

file permissions string

blocks

block allocated for file; (Unix only)

blksize

optimal file system I/O blocksize; (Unix only)

This function uses stat internally thus if the given filepath is a symbolic link, it is followed (if it points to another link the chain is followed recursively) and the information is about the file it refers to. To obtain information about the link itself, see function lfs.symlinkattributes.

lfs.chdir (path)

Changes the current working directory to the given path.

Returns true in case of success or nil plus an error string.

lfs.currentdir ()

Returns a string with the current working directory or nil plus an error string.

iter, dir_obj = lfs.dir (path)

Lua iterator over the entries of a given directory. Each time the iterator is called with dir_obj it returns a directory entry’s name as a string, or nil if there are no more entries. You can also iterate by calling dir_obj:next(), and explicitly close the directory before the iteration finished with dir_obj:close(). Raises an error if path is not a directory.

lfs.link (old, new[, symlink])

Creates a link. The first argument is the object to link to and the second is the name of the link. If the optional third argument is true, the link will by a symbolic link (by default, a hard link is created).

lfs.lock (filehandle, mode[, start[, length]])

Locks a file or a part of it. This function works on open files; the file handle should be specified as the first argument. The string mode could be either r (for a read/shared lock) or w (for a write/exclusive lock). The optional arguments start and length can be used to specify a starting point and its length; both should be numbers.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

fs.lock_dir (path, [seconds_stale])

Creates a lockfile (called lockfile.lfs) in path if it does not exist and returns the lock. If the lock already exists checks if it’s stale, using the second parameter (default for the second parameter is INT_MAX, which in practice means the lock will never be stale. To free the the lock call lock:free()

In case of any errors it returns nil and the error message. In particular, if the lock exists and is not stale it returns the “File exists” message.

lfs.mkdir (dirname)

Creates a new directory. The argument is the name of the new directory.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

lfs.rmdir (dirname)

Removes an existing directory. The argument is the name of the directory.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

lfs.setmode (file, mode)

Sets the writing mode for a file. The mode string can be either “binary” or “text”.

Returns true followed the previous mode string for the file, or nil followed by an error string in case of errors. On non-Windows platforms, where the two modes are identical, setting the mode has no effect, and the mode is always returned as binary.

lfs.symlinkattributes (filepath [, aname])

Identical to lfs.attributes except that it obtains information about the link itself (not the
file it refers to). On Windows this function does not yet support links, and is identical to
 lfs.attributes.

lfs.touch (filepath [, atime [, mtime]])

Set access and modification times of a file. This function is a bind to utime function.

The first argument is the filename, the second argument (atime) is the access time, and the third argument (mtime) is the modification time. Both times are provided in seconds (which should be
generated with Lua standard function 
os.time). If the modification time is omitted, the access time provided is used; if both times are omitted, the current time is used.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

lfs.unlock (filehandle[, start[, length]])

Unlocks a file or a part of it. This function works on open files; the file handle should be specified as the first argument. The optional arguments start andlength can be used to specify a starting point and its length; both should be numbers.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

Improvements to Kognitio Script execution and debugging

Metatables with a __tostring method have been added to the table returned by an SQL query. This allows formatted printing of those tables.

t = select * from sys.ipe_user;
print(t) –[[ prints the text below:
|NAME         |ID|DFLT_SCNO|SEC_CLASS|SCH_CLASS|STATUS|PTIME              |
|SYS          |0 |0        |0        |0        |0     |2016-12-21 11:19:57|
|GRP_LOG      |3 |0        |0        |0        |8     |2016-12-21 11:20:15|
|GRP_MONITOR  |1 |0        |0        |0        |8     |2016-12-21 11:20:15|
|GRP_DISKUSAGE|4 |0        |0        |0        |8     |2016-12-21 11:20:15|
|GRP_DBG      |2 |0        |0        |0        |8     |2016-12-21 11:20:15|
]]

This may also be used in the kogscript.exe command line tool.

The SQL error string is added to the table returned from the SQL query, and this is also printed out by the metatable __tostring method

t = select * from nonexistent;
if t.status ~= 0 then 
    print("Query failed: " .. t.error) --[[ prints the text below:
    Query failed: ERROR: [Kognitio][WX2 Driver][latest] CI3013: Table SYS.NONEXISTENT does not exist, offset 15
    ]]
end
 
print(t) –[[ prints the text below:
ERROR: [Kognitio][WX2 Driver][latest] CI3013: Table SYS.NONEXISTENT does not exist, offset 15
]]

Breakpoints may now also be placed on Lua statements within the Kognitio Script, even if the “Trace Kog Script Lines” option is not set. After the breakpoint is hit, single stepping then steps through each Lua line.

Timestamps are now converted to a table with year, month, day, hour, min and sec fields:

t = select * from sys.ipe_allview_img;
list(t.rows[1].CREATE_TIME) –[[ prints the text below:
lua::number day = 16
lua::number min = 45
lua::number month = 1
lua::number year = 2017
lua::number hour = 10
lua::number sec = 34
]]

Playing Around With Kognitio On Hadoop: Hashes Cannot Always Anonymise Data

30

Jan
2017
Posted By : Chris Barratt Comments are off
kognitio data scripts how to

In this brave new world of big data one has to be careful of the loss of privacy with the public release of large data sets. Recently I reread “On Taxis And and Rainbows” that presents an account of how the author came to de-anonymise a data set made available by New York City’s Taxi and Limousine Commission. This data set included logs of each journey made: locations and times of pickups and drop-offs along with supposedly anonymised medallion numbers and other data. The author came to identify the medallion number data as the MD5 check-sums of the medallion numbers. They were then able to compute a table mapping 22 million possible medallion numbers to their MD5 check-sums. They reported that this took less than 2 minutes. This table allowed them to start ‘de-anonymising’ the data.

The article identified that this was possible because of the nature of the data. The set of all possible medallion numbers was known and small enough for a table mapping from the data for check-sums to be generated for each value in a feasible amount time.

Computing a large number of check-sums should be easy for a Kognitio system to handle. As several check-sum functions are implemented as SQL functions in the Kognitio system, I wondered how quickly Kognitio could generate a table mapping some potentially personally identifying information to check-sums using the functions.

As every UK address has a postcode I thought that postcodes would form a good candidate for a piece of personally identifying information. They are roughly equivalent to US ZIP codes. It seems plausible to me that someone might attempt anonymise postcodes using check-sums. So I thought I would try using Kognitio to compute tables of postcodes and their check-sums to see how quickly such an anonymisation strategy might be broken.

I found an almost complete list of valid postcodes in the “Code-Point Open” data set. Handily, this is freely available from the Ordnance Survey under the Open Government Licence. The data set contains about 1.7 million records each with a postcode field and other fields for related geographical data. The field titles are:

Postcode,Positional_quality_indicator,Eastings,Northings,Country_code,NHS_regional_HA_code,NHS_HA_code,Admin_county_code,Admin_district_code,Admin_ward_code

The data set is provided as a zip archive containing 120 CSV files with a total uncompressed size of 154MiB.

Getting The Kognitio System Check-sum Functions Timed

For the check-summing task I created a new Kognitio on Hadoop system (the ‘Kognitio system’ from here on) to run on a small developmental Hadoop cluster that I have access to. I copied the CSV files into a new HDFS directory on the cluster running the Kognitio system.

For all the queries/SQL commands I used the ‘sys’ superuser of the Kognitio system. I was the only one who was going to use the Kognitio system so there was no need to protect data from anyone but myself and I could avoid setting up unneeded users and privileges for anything in the system.

I created an external table to give the Kognitio system access to the CSV data on HDFS. This kind of table can’t be written to but can be queried like any other ‘regular’ table on the Kognitio system. Then I created a view on the external table to project only the postcodes. I followed this by creating a view image on the view to pin the postcode only rows of the view in memory. It is faster to access rows in memory than anywhere else.

Then I created several ram-only tables to store postcodes with their check-sums. Ram-only tables as the name suggests do not store row data on disk and so using them avoids any slow disk operation overheads.

Finally I ran and timed a series of insert-select queries that computed the check-sums for our postcodes and inserted them into the ram-only tables.

Some Background On External Tables, Connectors and Plugin Modules

Kognitio’s external tables present data from external sources as a set of records of fields using a connector. The list of fields and the connector are specified by the create external table statement.

The connector in turn specifies the data provider to use along with any connector specific configuration parameter settings required. The data provider implements the interface to the external data source for the internals of the Kognitio system. Data providers are also known as connectors but I wanted to avoid potential confusion from shared names.

The data provider that I used to access the HDFS data is implemented as a plugin connector provided by a plugin module. Plugin Modules are how Kognitio system implements some features that are not in the core code. Loading and activating the plugin modules is required to make their features available to the Kognitio system. Plugin modules can also present parameters used for any module specific configuration required.

Creating The External Table: Making the Code-Point Open Data Accessible to the Kognitio System

The external table that I needed to access the data on the HDFS system required a connector that used the HDFS data provider. This is implemented as a plugin connector from the Hadoop plugin module. So I loaded the Hadoop plugin module, made it active and set the libhdfs and hadoop_client plugin module parameters.

create module hadoop;
alter module hadoop set mode active;
alter module hadoop set parameter libhdfs to '/opt/cloudera/parcels/CDH/lib64/libhdfs.so';
alter module hadoop set parameter hadoop_client to '/opt/cloudera/parcels/CDH/bin/hadoop';

The parameters indicate to the HDFS data provider where the libhdfs library and hadoop executable are on my Hadoop cluster.

Then I could create the connector that was needed:

create connector hdfs_con source hdfs target ' namenode hdfs://172.30.251.105:8020, bitness 64 ';

The create connector statement specifies where the namenode is and to use 64 bit mode, which is best.

Then I could create the external table:

create external table OSCodePoints
(
PostCode char(8),
Positional_quality_indicator varchar(32000),
Eastings varchar(32000),
Northings varchar(32000),
Country_code varchar(32000),
NHS_regional_HA_code varchar(32000),
NHS_HA_code varchar(32000),
Admin_county_code varchar(32000),
Admin_district_code varchar(32000),
Admin_ward_code varchar(32000)
)
from hdfs_con target
'file "/user/kodoopdev1/OSCodePoint2016Nov/*.csv"'
;

The ‘file’ parameter specifies which HDFS files contain the data I want to access.

I created a view with a view image:

create view PostCodes as select PostCode from OSCodePoints;
create view image PostCodes;

The view projects just the postcode column out. Creating the view image pins the results from the view into RAM. All queries on the view read data from the view image and ran faster as a result.

I generated check-sums using the md5_checksum, sha1_checksum, sha224_checksum, sha256_checksum, sha384_checksum and sha512_checksum functions. These are all implemented as plugin functions provided by the hashes plugin module. This was loaded and activated:

create module hashes;
alter module hashes set mode active;

I ran the check-sum functions with seven different insert-select queries to populate a set of lookup tables that mapped postcodes to their check-sums. Six of the queries generated a check-sum for one of each of the check-sum functions. The 7th query used all 6 check-sum functions to generate a lookup of table for all 6 check-sums simultaneously. This was done to get some data indicating what the overhead associated with retrieving and creating rows was. The 7 lookup tables were created as ram-only tables. As the name implies these hold the row data in RAM and not on disk and so removed the overhead of writing the rows to disk.

The ram-only table create statements were:

create ram only table PostCodeMD5Sums
(
PostCode char(8),
md5_checksum binary(16)
);


create ram only table PostCodeSHA1Sums
(
PostCode char(8),
sha1_checksum binary(20)
);


create ram only table PostCodeSHA224Sums
(
PostCode char(8),
sha224_checksum binary(28)
);


create ram only table PostCodeSHA256Sums
(
PostCode char(8),
sha256_checksum binary(32)
);


create ram only table PostCodeSHA384Sums
(
PostCode char(8),
sha384_checksum binary(48)
);


create ram only table PostCodeSHA512Sums
(
PostCode char(8),
sha512_checksum binary(64)
);


create ram only table PostCodeCheckSums
(
PostCode char(8) CHARACTER SET LATIN1,
md5_checksum binary(16),
sha1_checksum binary(20),
sha224_checksum binary(28),
sha256_checksum binary(32),
sha384_checksum binary(48),
sha512_checksum binary(64)
);

The insert-select queries were:

insert into PostCodeMD5Sums
select
PostCode,
md5_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA1Sums
select
PostCode,
sha1_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA224Sums
select
PostCode,
sha224_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA256Sums
select
PostCode,
sha256_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA384Sums
select
PostCode,
sha384_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA512Sums
select
PostCode,
sha512_checksum(PostCode)
from
PostCodes
;


insert into PostCodeCheckSums
select
PostCode,
md5_checksum(PostCode),
sha1_checksum(PostCode),
sha224_checksum(PostCode),
sha256_checksum(PostCode),
sha384_checksum(PostCode),
sha512_checksum(PostCode)
from
PostCodes
;

Results And Conclusions

For the 1.7 million rows (1691721 to be exact) the times for the insert-select queries were:

md5_checksum: 0.62s
sha1_checksum: 0.53s
sha224_checksum: 0.63s
sha256_checksum: 0.67s
sha384_checksum: 0.83s
sha512_checksum: 0.86s
all 6 checksums: 2.87s

The results show that the check-sum for each postcode is being calculated in something around 500ns or less. This system could calculate about 170 billion check-sums a day. The Kognitio system was running on a node with two E5506s, giving 8 cores running at 2.13GHz, the node is several years old. It seems obvious to me the just using check-sums and/hashes alone isn’t going to hide data with this system, even more so with faster up-to-date hardware.

Comparing the “all 6 check-sums” insert-select query run time with those of the others did show that there was an appreciable overhead in retrieving and writing the row data. The total time taken to populate 6 lookup tables of just one check-sum value was 1.44 times more than the time taken to populate the one lookup table of all six check-sum functions.

Because I used the Code-Point Open data:
Contains OS data © Crown copyright and database right (2016)

Contains Royal Mail data © Royal Mail copyright and Database right (2016)

Contains National Statistics data © Crown copyright and database right (2016)

Facebook

Twitter

LinkedId