External tables and scripts and user mapping

27

Apr
2017
Posted By : Ben Cohen 0 Comment
external scripts, tables, user mapping, sql query
Categories :Blog, Kognitio How To

Introduction

Kognitio has two mechanisms for external programs to provide or operate on data during an SQL query: external tables and external scripts. These can be written in the user’s preferred language, for example R, Python, shell or even C.

External tables are used like normal tables in SQL queries but the rows are supplied by an external “connector” program rather than being stored by WX2. The connector could generate the rows itself
(TPC-H data, for example) or fetch them from another database, HDFS or an internet data source.

External scripts are table functions in SQL which take a subquery “argument” and return a table. They allow users to create complex analytics or achieve functionality that the database itself doesn’t provide. The script is run under the “script environment” interpreter with the subquery rows as input, and the output rows will be returned to the SQL query.

Here is an example of an external table and an external script. There are more examples in
wx2/current/examples/excon_* and wx2/current/examples/exscripts.sql under your WX2 install directory.

-- External table to return the current environment for each connector instance.
create connector EXCON_ENV
    -- adjust this path if required
    command '/opt/kognitio/wx2/current/examples/excon_env';
create external table ENV (CONNECTOR_NUM int, 
                           NAME varchar(10000),
                           VALUE varchar(10000))
    from EXCON_ENV;

-- Results: environment of each connector
--          e.g. (0, 'WX2_TNO', '629'), (0, 'WX2_METHOD', 'LOAD'), ...
select * from ENV;

-- External script to perform an arbitrary regular expression substitution
-- using the stream editor, sed.
create script environment SED command '/bin/sed -f';
create external script REGEXP environment SED
    receives (TEXT varchar) input 'format "%{quoted=0}s\\n"'
    sends (SUBST_TEXT varchar) output 'format "%0c\\n"'
    script 's/ko\(.*\)/i\1n/';

-- Results: ('ignition')
external script REGEXP from (select 'kognitio');

Kodoop

The following sections on user mapping apply only to Kognitio running in appliance mode.

For Kognitio on Hadoop, external connectors and scripts currently run as the same user that runs the Kodoop processes. This is expected to change in the future to a broadly similar scheme tailored to Hadoop.

Why map users?

The examples above are simple so it doesn’t matter which Linux user runs the external script or connector processes, but it can do in practice. You might want your script to run as a user that has already been set up, perhaps with SSH keys or a uid needed for NFS.

It is also preferable not to execute user-supplied code as a privileged user – or in this case one with access to database files.

Kognitio chooses the Linux user based on the Kognitio user and the query context. By default – unless it has been configured otherwise as described below – it will use the user wxextern, which is created when WX2 is installed. (But wxextern is not created automatically by upgrade from versions earlier than 8.1.)

Allowed Linux users

Kognitio doesn’t simply allow scripts or connectors to run as any Linux user. By default only wxextern is allowed. Any other user must be allowed by name or uid (user id) by editing the “[security]” section of the config file as follows.

A Linux user is allowed if:

  1. The user name is listed in the config file setting “ext_users_allowed” comma-separated list of user names; or
  2. The config file settings “min_ext_user_id” and “max_ext_user_id” are set to positive numbers with min_ext_user_id <= uid <= max_ext_user_id, and the user must not be listed in “ext_users_denied” comma-separated list of user names.

(By default ext_users_allowed is “$extern” and ext_users_denied is “root“. Strictly, in (2) only one of min_ext_user_id and max_ext_user_id needs to be non-negative, with the inequality ignored for the negative value.)

You need to restart WX2 (“wxserver start“) after changing any of these config file settings for them to take effect.

User map strings

Kognitio attempts to map from a Kognitio user to a Linux user using a user map string. This is a comma-separated list of any of the following elements:

User map element Linux user to match
name User name
& Same as Kognitio user (e.g. SYS)
$extern Kognitio extern user, by default wxextern
$admin Kognitio admin user, by default wxadmin
$master Kognitio master user, by default wxroot
$owner Kognitio owner user, by default root

It will try each name in turn and will use the first one which, when converted to lower case, gives the name of a Linux user which is allowed by the security parameters. If no valid match is found then it will fail with the error “ET0111: User mapping for the external operation was invalid”.

For example, for the user map string “kognitio,&,$extern” for a Kognitio user called MEPHISTOPHELES it will try Linux users kognitio, mephistopheles and wxextern.

Note that this is evaluated separately on each node, so it is recommended that the Linux user names and uids are synchronised across the cluster. Otherwise it is possible for different user names to be matched, the same name being matched but the uids differing, or names being matched on only a subset of nodes (the ET0111 error will be raised in this case).

If an ET0111 error is raised the Kognitio user and the user map string are logged, as of version 8.2, in the server debug log file (look at `wxlogd smd`/serverdbg.*), which should be helpful to see why it didn’t work. For example:

T_2017-04-21_09:06:12_BST: ET: Invalid user mapping (ET0111) on 'hp-rack2-enc2-11' for 'SYS' -> 'kognitio,&,$extern'

Which user map string is used?

When Kognitio is about to launch an external script or connector it obtains a user map string. It first looks in IPE_EXTERNAL_USER_MAP for a user map string based on the execution context; if it doesn’t find one then it uses at the config file setting “[boot options] default_usermap“, which defaults to “&,$extern“. It then tries to map the current Kognitio user to a Linux user as described above.

For external scripts the search is performed in this order matching against the columns (OBJECT_TYPE, OBJECT_ID, USER_ID) of IPE_EXTERNAL_USER_MAP.

OBJECT_TYPE OBJECT_ID USER_ID
PRIVDOMAIN_EXT_SCRIPT(13) Script Id UserId
PRIVDOMAIN_EXT_SCRIPT(13) Script Id -1
PRIVDOMAIN_SCRIPT_ENV(10) Script Environment Id UserId
PRIVDOMAIN_SCRIPT_ENV(10) Script Environment Id -1
PRIVDOMAIN_EXT_SCRIPT(13) -1 User Id
PRIVDOMAIN_EXT_SCRIPT(13) -1 -1
-1 -1 User Id
-1 -1 -1

For external tables the search is performed in this order matching against the columns (OBJECT_TYPE, OBJECT_ID, USER_ID) of IPE_EXTERNAL_USER_MAP.

OBJECT_TYPE OBJECT_ID USER_ID
PRIVDOMAIN_TABLE(3) Table Id User Id
PRIVDOMAIN_TABLE(3) Table Id -1
PRIVDOMAIN_CONNECTOR(9) Connector Id User Id
PRIVDOMAIN_CONNECTOR(9) Connector Id -1
PRIVDOMAIN_TABLE(3) -1 User Id
PRIVDOMAIN_TABLE(3) -1 -1
-1 -1 User Id
-1 -1 -1

The administrator can add and remove mappings in IPE_EXTERNAL_USER_MAP using this special syntax.

CREATE EXTERNAL USER MAP
  [OBJECT_TYPE=[EXTERNAL SCRIPT|SCRIPT ENVIRONMENT|EXTERNAL TABLE|CONNECTOR],]
  [OBJECT_ID = 1234,]
  [USER_ID = 1234,]
  USERMAP = '<user map>';

DROP EXTERNAL USER MAP
  [OBJECT_TYPE=[EXTERNAL SCRIPT|SCRIPT ENVIRONMENT|EXTERNAL TABLE|CONNECTOR],]
  [OBJECT_ID = 1234,]
  [USER_ID = 1234];

Simple performance checks against your hardware cluster

24

Mar
2017
Posted By : Simon Darkin 1 Comment
performance hardware cluster, cpu. benchmarks

Kognitio have a lot of experience commissioning clusters of new hardware for our MPP software product. As part of that process, we’ve developed a number of steps for validating the performance of new clusters, and these are the topic of this blog entry.

 

There are many Linux based benchmarking tools on the market however they are not usually installed by default, in which case some simple command line tools can be used to quickly establish if there is a potential hardware issue that warrants further investigation.    The following hardware components are covered by this topic:

  • CPU
  • Disk
  • Networking
  • RAM

 

CPU   

A slow CPU or core could have an adverse effect on query performance, and so with the use of basic command line tools you can help identify laggards.  A background ‘for’ loop can be employed to ensure all cores/threads are tested simultaneously.

 

Integer arithmetic test

Invoked 8 times to run simultaneously against 8 cores

 

for i in `seq 1 8`; do time $(i=0; while (( i < 999999 )); do (( i ++ )); done)& done; wait

 

this will return the time taken to increment an integer over the specified range.  A comparison of the time taken by each core will help identify outliers

 

real    0m8.908s
user    0m8.765s
sys     0m0.140s

real    0m8.943s
user    0m8.789s
sys     0m0.156s

real    0m8.997s
user    0m8.761s
sys     0m0.112s

real    0m9.000s
user    0m8.853s
sys     0m0.144s

real    0m9.023s
user    0m8.881s
sys     0m0.140s

real    0m9.028s
user    0m8.861s
sys     0m0.168s

real    0m9.034s
user    0m8.857s
sys     0m0.176s

real    0m9.073s
user    0m8.781s
sys     0m0.156s

 

Whilst the test is running you can check that each core is under load by running top and expanding the output to show all cores.  If you do encounter outliers in the arithmetic test then you can use the output from top to identify which core(s) remain busy when others have finished

 

Cpu0  : 98.3%us,  1.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu1  : 99.0%us,  1.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu2  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu3  : 99.3%us,  0.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu4  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu5  : 98.0%us,  2.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu6  : 98.3%us,  1.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu7  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

 

Compression test

As with the arithmetic test this example loops around 8 times so that 8 cores are tested simultaneously.  Data is written to /dev/null to avoid any overhead associated with disk IO.

 

for i in `seq 1 8`; do dd if=/dev/zero bs=1000 count=1000000 | gzip >/dev/null&  done; wait

 

this will return the rate at which each core is able to compress 1 GB of data

 

1000000000 bytes (1.0 GB) copied, 11.9277 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9277 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9545 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9799 seconds, 83.5 MB/s

1000000000 bytes (1.0 GB) copied, 11.9831 seconds, 83.5 MB/s

1000000000 bytes (1.0 GB) copied, 12.0085 seconds, 83.3 MB/s

1000000000 bytes (1.0 GB) copied, 12.0382 seconds, 83.1 MB/s

1000000000 bytes (1.0 GB) copied, 12.2655 seconds, 81.5 MB/s

 

With Kognitio software installed you can use the wxtool command  to run the compression test simultaneously against all database nodes to aid comparison across the cluster as a whole.  You can download the software for free at http://kognitio.com/free-download/

 

wxtool -a '{can DB}' -S 'for i in `seq 1 8`; do dd if=/dev/zero bs=1000 count=1000000 | gzip >/dev/null&  done; wait'

 

For node kap1-1 (ecode 0, 866 bytes):

1000000000 bytes (1.0 GB) copied, 11.9422 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9659 seconds, 83.6 MB/s

1000000000 bytes (1.0 GB) copied, 11.9876 seconds, 83.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.0142 seconds, 83.2 MB/s

1000000000 bytes (1.0 GB) copied, 12.1293 seconds, 82.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.3754 seconds, 80.8 MB/s

1000000000 bytes (1.0 GB) copied, 12.4132 seconds, 80.6 MB/s

1000000000 bytes (1.0 GB) copied, 12.4386 seconds, 80.4 MB/s

For node kap1-3 (ecode 0, 864 bytes):

1000000000 bytes (1.0 GB) copied, 11.8398 seconds, 84.5 MB/s

1000000000 bytes (1.0 GB) copied, 11.8661 seconds, 84.3 MB/s

1000000000 bytes (1.0 GB) copied, 11.8893 seconds, 84.1 MB/s

1000000000 bytes (1.0 GB) copied, 11.9165 seconds, 83.9 MB/s

1000000000 bytes (1.0 GB) copied, 11.946 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.953 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9637 seconds, 83.6 MB/s

1000000000 bytes (1.0 GB) copied, 12.2996 seconds, 81.3 MB/s

For node kap1-3 (ecode 0, 866 bytes):

1000000000 bytes (1.0 GB) copied, 11.8757 seconds, 84.2 MB/s

1000000000 bytes (1.0 GB) copied, 11.8846 seconds, 84.1 MB/s

1000000000 bytes (1.0 GB) copied, 11.9178 seconds, 83.9 MB/s

1000000000 bytes (1.0 GB) copied, 11.9243 seconds, 83.9 MB/s

1000000000 bytes (1.0 GB) copied, 11.9377 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9834 seconds, 83.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.3367 seconds, 81.1 MB/s

1000000000 bytes (1.0 GB) copied, 12.3942 seconds, 80.7 MB/s

For node kap1-4 (ecode 0, 864 bytes):

1000000000 bytes (1.0 GB) copied, 11.91 seconds, 84.0 MB/s

1000000000 bytes (1.0 GB) copied, 11.9291 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9448 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9498 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 12.1232 seconds, 82.5 MB/s

1000000000 bytes (1.0 GB) copied, 12.3896 seconds, 80.7 MB/s

1000000000 bytes (1.0 GB) copied, 12.4449 seconds, 80.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.4504 seconds, 80.3 MB/s

 

 

 

DISKS

Having just one underperforming disk in the system can significantly impact query performance against disk based tables. Here are some simple tests to help identify any anomalies.

 

Iterative write speed test with dd.

 

WARNING: THIS WRITE TEST WILL OVERWRITE DATA AT THE SPECIFIED DESTINATION AND SO MAKE SURE YOU AVOID AREAS THAT CONTAIN SYSTEM DATA OR USER DATA THAT YOU WISH TO KEEP

 

for i in `seq 1 3`; do echo "Loop $i"; dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=10000 count=100000 conv=fsync; echo ""; done

 

this will return the duration and rate at data can be written out to disk.  In this example 1 GB of data is repeatedly written to a raw partition.  Note that fsync is used to flush the writeback cache and ensures data is written to the physical media.

 

Loop 1
100000+0 records in
100000+0 records out
1000000000 bytes (1.0 GB) copied, 13.6466 seconds, 73.3 MB/s

Loop 2
100000+0 records in
100000+0 records out
1000000000 bytes (1.0 GB) copied, 12.8324 seconds, 77.9 MB/s

Loop 3
100000+0 records in
100000+0 records out
1000000000 bytes (1.0 GB) copied, 12.4271 seconds, 80.5 MB/s

 

With Kognitio software installed, the test can be expanded to run on all database nodes allowing for easy comparison of all disks in the system

 

wxtool -a '{can DB}' -S 'for i in `seq 1 3`; do echo "Loop $i"; dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=10000 count=100000 conv=fsync; echo ""; done'

 

Iterative read speed test with dd

 

for i in `seq 1 3`; do let skip=$i*5000; echo "Loop $i - skip = $skip"; sync ; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/cciss/c0d0p2 of=/dev/null bs=1000 count=1000000 skip=$skip ;echo ""; done

 

this will return the rate at which data can be read from disk.  In this example 1 GB of data is being read from a raw partition, adjusting the offset and flushing the buffer on each iteration to ensure data is being read from the physical media.

 

Loop 1 - skip = 5000
1000000+0 records in
1000000+0 records out
1000000000 bytes (1.0 GB) copied, 14.4355 seconds, 69.3 MB/s

Loop 2 - skip = 10000
1000000+0 records in
1000000+0 records out
1000000000 bytes (1.0 GB) copied, 12.9884 seconds, 77.0 MB/s

Loop 3 - skip = 15000
1000000+0 records in
1000000+0 records out
1000000000 bytes (1.0 GB) copied, 12.6045 seconds, 79.3 MB/s

 

With Kognitio software installed, the test can be expanded to run on all database nodes to aid comparison across the entire system.

 

wxtool -a '{can DB}' -S 'for i in `seq 1 3`; do let skip=$i*5000; echo "Loop $i - skip = $skip"; sync ; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/cciss/c0d0p2 of=/dev/null bs=1000 count=1000000 skip=$skip ;echo ""; done'

 

Iterative read speed test with hdparm

 

for i in `seq 1 3`; do echo "Loop $i"; hdparm --direct -t /dev/cciss/c0d0p2; echo ""; done

 

this will return the rate at which data can be read sequentially from disk without any file system overhead.

 

Loop 1
/dev/cciss/c0d0p2:
Timing O_DIRECT disk reads:  236 MB in  3.01 seconds =  78.40 MB/sec

Loop 2
/dev/cciss/c0d0p2:
Timing O_DIRECT disk reads:  236 MB in  3.02 seconds =  78.09 MB/sec

Loop 3
/dev/cciss/c0d0p2:
Timing O_DIRECT disk reads:  230 MB in  3.01 seconds =  76.30 MB/sec

 

With Kognitio software installed, the test can be expanded to run on all database nodes to aid comparison across the entire system.

 

wxtool -a '{can DB}' -S 'for i in `seq 1 3`; do echo "Loop $i"; hdparm --direct -t /dev/cciss/c0d0p2; echo ""; done'

 

Disk based table scan

 

If the cluster is running Kognitio database software you can initiate a scan of a large disk based table and review the output from wxtop in order to spot any disk store processes that remain busy for a significant period after others have finished.   For accurate results you should ensure there is no concurrent activity when performing this test.

 

select *
from <large disk based table>
where <condition unlikely to be true>;

 

Monitor the output from wxtop and look out for any disk store processes that remain busy when all or most others have finished.

 

PID       NODE        PROCESS                           SIZE      TIME
15784       kap1-1      WXDB(55): Diskstore             258036       100
22064       kap1-2      WXDB(18): Diskstore             257176        86
25179       kap1-3      WXDB(73): Diskstore top         258200        84
31237       kap1-4      WXDB(37): Diskstore             258068        77

 

If a disk store process does appear to lag behind, then you should eliminate the possibility of it being attributable to data skew by checking the row counts across all of the disks using the following query

 

select
mpid,
sum(nrows) nrows
from ipe_ftable
where table_id = <table_id being scanned>
group by 1
order by 2 desc;

 

 

NETWORKING

You can test the network links between nodes using some simple netcat commands.  This will allow you to spot links that are underperforming.

 

Link speed test using dd and netcat

 

The name and options associated with the netcat binary will depend on the Linux installation, however with Kognitio software installed you can use wxnetread and wxnetwrite for the data transfer regardless.

 

Setup a listening process on the node performing the read

 

netcat -l -p 2000 > /dev/null &

 

use dd to generate some data and pipe through netcat to the IP and port of the node performing the read

 

dd if=/dev/zero bs=1000 count=1000000 | netcat 14.4.21.2 2000

 

this will return the rate at which data can be copied over the link

 

1000000000 bytes (1.0 GB) copied, 8.54445 seconds, 117 MB/s

 

The same test as above this time using wxnetread/wxnetwrite

 

wxnetread -l -p 2000 > /dev/null &

 

dd if=/dev/zero bs=1000 count=1000000 | wxnetwrite 14.4.21.2 2000

 

1000000000 bytes (1.0 GB) copied, 8.5328 seconds, 117 MB/s

 

Shape tests

With Kognitio software installed you can run Shape tests to measure the speed at which RAM based data is re-distributed between nodes

 

wxtester -s <dsn> -u sys -p <password> -Ishape 5000 9000 1

 

Once the tests have been running for a few minutes you can navigate to the logs directory and check the data rate

 

cd `wxlogd wxtester`
grep TSTSHN results | gawk '{ if ($5==64) print ((($3*$5)/$6)/<number of database nodes>)/1048576 }'

 

With older generation hardware you can expect to see performance of  40MB/s/node given sufficient network bandwidth.       With newer hardware, for example HP Gen9 servers with 2x 56Gb/s links per node this increases to 90MB/s/core.

 

 

RAM

Benchmarking RAM performance is best left to a dedicated test suite, however you can perform a very simple write/read speed test using dd in conjunction with a temporary file storage facility in RAM, which at the very least can show up a mismatch in performance between nodes.

 

Write and read speed test using dd

 

mkdir RAM
mount tmpfs -t tmpfs RAM
cd RAM
dd if=/dev/zero of=ram_data bs=1M count=1000

 

this will return the rate at which data is written to RAM

 

1048576000 bytes (1.0 GB) copied, 1.06452 seconds, 985 MB/s

 

dd if=ram_data of=/dev/null bs=1M count=1000

 

this will return the rate at which data is read from RAM

 

1048576000 bytes (1.0 GB) copied, 0.6346 seconds, 1.7 GB/s

 

Participate in the Kognitio Console beta test program

10

Mar
2017
Posted By : Michael Atkinson Comments are off
Kognitio console, beta test program
Categories :Blog, Kognitio How To

Kognitio Console is Kognitio’s client side management program for the Kognitio Analytical Platform.

Some of its features are:

  • It allows inspection of the metadata tree for schemas, tables, views, external scripts, users, connectors, sessions, queues, etc.
  • It also gives an object view to each of these metadata objects, allowing their inspection and management.
  • There are also lots of tools, wizards and widgets to browse data in Hadoop, load and unload data, identify problem queries and many more.
  • There are also a set of reports and dashboards to monitor the state of Kognitio systems.
  • Macros may be written to extend Kognitio Console, the reports and dashboards are written in these XML macros.
  • Ad-Hoc queries may be executed.
  • KogScripts may be executed and debugged. KogScript is based on Lua but has Kognitio specific enhancements to run SQL natively.
  • It is an integrated development environment (IDE) for KogScripts, and external scripts running in Kognitio.

All this makes Kognitio Console a versatile tool, suitable for database admins, analysts and power users.

Kognitio Console is in constant development, beta and release candidate versions are provided from our update website. Beta and release candidates will be announced on the Kognitio Console forums

There are two ways of obtaining these betas:

Note that this is beta software, so has not gone through a full QA. However, we endeavour to make sure it has no obvious bugs and it will have passed our Console smoke tests. If you experience crashes or other bugs or deficiencies in these beta and release candidates, please report them to us, this will help us make the next release more stable.

By installing it into a different location than the default, you can have both this and the last stable releases installed at the same time.

The differences between KogScript running in Console and kogscript.exe

24

Feb
2017
Posted By : Michael Atkinson Comments are off
KogScript, Lua, SQL, programming, how to, console window
Categories :Blog, Kognitio How To

KogScript is a general purpose programming language based on Lua. It may be run within Console, which acts as an integrated development environment for it, or run from the command line tool kogscript.exe.

The KogScript running within Console and via kogscript.exe are compiled from the same source files and are identical with the following exceptions.
 

sqldebug

In kogscript.exe the global variable sqldebug is set to 3. The sqldebug variable is used to control how much information is output on each sql query within the KogScript as it is executed. A level of 3 shows the query status, and table contents.

In Console sqldebug is not defined, so it acts as if it had a value of 0, this means no information is output to the terminal on sql execution, instead Console has integration of queries within the script view as info and table and stored in the script history.

withinConsole

When running within Console the global variable withinConsole is set to true
 

Console integration

When running within Console there are two extra global tables holding data and methods used to integrate scripts with Console (within the table kog) and access Qt objects and methods (within the table qt).

An example of how to use the qt methods may be found in the Examples tool as “Using Qt from Kog Scripts”. This shows how to open Qt dialogs, and use Qt grids and views.

The kog global table holds functions to access the main elements of the Console application user interface. From those it is possible to access the rest of the user interface. There are about 100 classes and thousands of objects available. To find out which methods, variables and enumerations are available for one of the kog functions use list(), as in:

mw = kog.mainWindow()
list(mw)

Which gives outout like:

 

Loader and Unloader

To run either of these from KogScript, the script needs to be run within Console as they make use of Console integration to run the Loader or Unloader tools built into Console. The example below shows how to run Unloader from within Console.

-- Get a reference to the console main window.
mw = kog.mainWindow();

config =[[
FILENAME=C:/TEMP/simplevector_test.csv
TABLE="MIKE"."A_SIMPLE_VECTOR"
AUTOUNLOAD=1
SILENT=1]];

-- Now run the unloader using the configuration above. 
mw:runUnloader(config)

 

Conclusion

KogScript within Console and the kogscript.exe command line tools are identical as far as running SQL queries and the Lua language.

Where they differ is that within Console there are extra features to enable integration with Console itself.

How to generate an identity / auto_increment column in Kognitio

17

Feb
2017
Posted By : Michael Atkinson Comments are off
Categories :Blog, Kognitio How To

Occasionally we would like the value of the primary key field to be created automatically every time a new record is inserted. That is we would like an auto-incrementable field in a table.

What other databases do

Other databases use a variety of syntax to do this.

MySQL uses an AUTO_INCREMENT keyword, which is used within the table definition

CREATE TABLE fruit (
         ID int NOT NULL AUTO_INCREMENT,
         name varchar(255) NOT NULL,
         PRIMARY KEY (ID) );

 

The SQL Server syntax uses the IDENTITY keyword, again within the table definition, in this example starting at 1 and incrementing by 1.

CREATE TABLE city (
         ID int IDENTITY(1,1) PRIMARY KEY,
         name varchar(255) NOT NULL,
         PRIMARY KEY (ID) );

 

For Oracle you first need to create a sequence and then use its nextval attribute during inserts.

CREATE SEQUENCE seq_people
        MINVALUE 1
        START WITH 1
        INCREMENT BY 1
        CACHE 10;
INSERT INTO People (ID,FirstName,LastName)
VALUES (seq_people.nextval,'Mike','Atkinson');

 

Kognitio’s row_number syntax.

Kognitio does not have a way of creating a unique value for a field directly in the table definition, instead it has two ways of adding unique values to resultsets; and then these resultsets may be inserted into a table. The first uses the ROW_NUMBER function.

The ROW_NUMBER function assigns to each row in the partition or query a sequence number starting from one. This is only a number used in the context of the result set, if the result changes, the ROW_NUMBER will change. The ROW_NUMBER expression takes an ORDER BY statement with the column to be used for the row count with an OVER operator.

In the example below we use ROW_NUMBER to replace the subject name with an id value which is a foreign key reference into the subjects table.

set schema yorschema;
create table facts_origin (subject varchar, predicate varchar, object varchar);
create table subjects (id int, name varchar not null, primary key(id) );
create table facts (subject int, predicate varchar, object varchar, 
               foreign key (subject) references subjects(id) );

insert into facts_origin values ('Peter', 'Brother', 'Sarah'), 
                                                           ('Peter', 'Brother', 'Anne'), 
                                                           ('John', 'Father', 'Peter');


insert into subjects
select with
    ids(name) as (select distinct subject from facts_origin)
select row_number() over (order by name) as "id", name  from ids;

insert into facts
    select s.id, f.predicate, f.object
        from facts_origin f, subjects s
        where f.subject=s.name;

select * from subjects;
select * from facts;
Which gives a subjects table of: And a facts table of:
ID NAME
1 John
2 Peter
SUBJECT PREDICATE OBJECT
1 Father Peter
2 Brother Anne
2 Brother Sarah

 

Kognitio’s generate_key syntax.

For an individual SELECT statement the GENERATE_KEY function generates a unique key for each row. These rows may then be inserted into a table.

The result is an INT8 data type with a value greater than or equal to zero. The results are not typically contiguous, but are guaranteed to be unique for an individual SELECT statement. The values generated are dependent upon the number of Kognitio nodes and the distribution of the data. Rerunning a query may not generate the same results. Multiple occurrences of GENERATE_KEY in a SELECT list will all produce the same result within a single row.

GENERATE_KEY can only be used in the SELECT list, it cannot be used in WHERE, HAVING, GROUP BY or ORDER BY clauses. Within the SELECT list you can perform arithmetic on the columns containing the GENERATE_KEY function.

insert into subjects
select with
    ids(name) as (select distinct subject from facts_origin)
select generate_key(), name  from ids;

The Kognitio SQL Guide has examples of good practice for ensuring that addition INSERT statements have unique keys and how to use them in ETL processes.

 

A more complete example

The following example shows many techniques, it uses both row_number and generate_key functions and illustrates how to use them to add entries to existing tables.

It also illustrates the use of KogScript functions, KogScript variables and a more complex “Grandparent” query.

set schema yourschema;

-- drop tables, they may have been created by previous runs of the script
drop table facts_origin;
drop table facts;
drop table entities;
drop table preds;

-- create tables to use
create table facts_origin (subject varchar, predicate varchar, object varchar);
create table entities (id int, name varchar not null, primary key(id) );
create table preds (id int, name varchar not null, primary key(id) );
create table facts (subject int, predicate int, object int,
               foreign key (subject) references entities(id),
               foreign key (predicate) references preds(id),
               foreign key (object) references entities(id) );

-- add some facts, a small genealogy
insert into facts_origin
     values ('Peter', 'Brother', 'Sarah'), ('Peter', 'Brother', 'Anne'), ('John', 'Father', 'Peter');
insert into facts_origin
     values ('Peter', 'Father', 'James'), ('Jane', 'Mother', 'Peter'), ('Mary', 'Mother', 'John');

-- create the entities table
insert into entities
select with
    ents(name) as (select distinct subject from facts_origin union select object from facts_origin)
select row_number() over (order by name) as "id", name  from ents;


-- create the preds table
insert into preds
select with
    ids(name) as (select distinct predicate from facts_origin)
select generate_key(), name  from ids;

-- create the facts table, this uses references to the entities and preds tables
insert into facts
    select s.id, p.id, o.id
        from facts_origin f, entities s, preds p, entities o
        where f.subject=s.name and f.predicate=p.name and f.object=o.name;

-- check the tables are what we expect
select * from entities;
select * from preds;
select * from facts;

-- Query for all facts where 'Peter' is the subject
with
  -- select the subject id
  s(id) as (select id from entities where name='Peter'),
  -- Filter the facts table for that subject id
  f(subject, predicate, object) as (select subject, predicate, object from facts, s 
                        where subject in (s.id) )
  -- Then join the filter results with the subjects table to get the subject name
select s.name as "SUBJECT", p.name, o.name as "OBJECT"
    from f, entities s, preds p, entities o 
    where f.subject=s.id and f.predicate=p.id and f.object=o.id;


-- Use a function to add entities if they do not already occur in the entities table.
function addEntity(p)
   local t = select id from entities where name='$p';
   if #t.rows~=1  then
       insert into entities
           select mk + 1, '$p' from (select max(id) from entities) dt(mk);
                 t = select id from entities where name='$p';
       assert(#t.rows==1)
   end
   return t.rows[1][1];
end

-- Use a function to add predicates if they do not already occur in the preds table.
function addPred(p)
   local t = select id from preds where name='$p';
   if #t.rows~=1  then
       insert into preds
           select generate_key() + mk + 1, '$p' from (select max(id) from preds) dt(mk);
                 t = select id from preds where name='$p';
       assert(#t.rows==1)
   end
   return t.rows[1][1];
end


-- adding facts
function addFact(s,p,o)
    local sub = addEntity(s) – use local to avoid polluting the global namespace
    local pred = addPred(p)
    local obj = addEntity(o)
    local t = select count(*) as "count" from facts
                               where subject=$sub and predicate=$pred and object=$obj;
    if #t.rows~=1 or t.rows[1].count==0 then
         insert into facts values ($sub,$pred,$obj);
    end

end

addFact('Garry', 'Father', 'Mary')


-- grandparent view.

parent = addPred('Parent')
grandparent = addPred('Grandparent')

-- Now form the query as a view, so that it may be reused easily
create view grandparent
select with
    -- Define the father_of subquery
    father_of(s,p,o) as  (select subject, predicate, object from facts
                 where predicate in (select id from preds where name='Father') ),
    -- Define the mother_of subquery
    mother_of(s,p,o) as (select subject, predicate, object from facts
                 where predicate in (select id from preds where name='Mother') ),
    -- Define the parent_of subquery, this uses the Parent predicate id we have defined above
    parent_of(s,p,o) as  (select s,$parent,o from father_of union
                 select s,$parent,o from mother_of),
    -- Define the grandparent_of subquery, this uses the Grandparent predicate id we 
    -- have defined above
    grandparent_of(s,p,o) as (select p.s, $grandparent, c.o from parent_of p, parent_of c where p.o=c.s)
-- Now output the result in a human readable form
select s.name as "SUBJECT", p.name as "PREDICATE", o.name as "OBJECT"
    from grandparent_of f, entities s, preds p, entities o 
    where f.s=s.id and f.p=p.id and f.o=o.id;


-- Now we can perform various queries
-- Who are John's grandchildren?
select object as "Grandchild" from grandparent where subject='John';

-- Who are Peter's grandparents?
select subject as "Grandparent" from grandparent where object='Peter';

-- Is Jane the grandparent of James?
select 'Yes' from grandparent where subject='Jane' and object='James';

And here is the entity diagram for the facts table

 

Summary

Various databases have different ways of creating unique values in a field. The Kognitio Analytical Platform has two methods GENERATE_KEY and ROW_NUMBER, both of which generate unique values in a subquery that may then be inserted into a table, giving a column in the table with a unique value as a result.

Kits, cats, sacks, wives: unfolding JSON data into tables

14

Feb
2017
Posted By : Graeme Cole Comments are off
unfolding json data into tables, csv
Categories :Blog, Kognitio How To
Tags :  ,,

Each wife had seven sacks…

Converting simple, flat JSON data into CSV or another tabular format is a problem addressed by many freely-available JSON-to-CSV converters. However, these are usually only useful when the JSON is a series of flat objects without any nested arrays.

In practice, real-world JSON files often have nested arrays containing more objects, each of which may contain arrays of their own. They tend to look like the old riddle As I was going to St Ives:

As I was going to St. Ives,
I met a man with seven wives.
Each wife had seven sacks,
Each sack had seven cats,
Each cat had seven kits.
Kits, cats, sacks, wives,
How many were there going to St. Ives?

(more…)

What’s new in Kognitio Console version 80199s170123

07

Feb
2017
Posted By : Michael Atkinson Comments are off
Whats new in Kognitio Console
Categories :Blog, Kognitio How To

SQL in expression contexts

 Up until this release SQL could only occur in Lua statements or assignments. For example in Kognitio Scripts we can write SQL within the script as a statement or assigning the results table to a variable.

drop table hr.employee;
t = select * from sys.ipe_user; 

This release of Kognitio Console adds the ability to place SQL in Lua expression contexts. So now within Kognitio Scripts we can place SQL in most places where a Lua expression may occur.

In the example below the data for the Kognitio users are read from the table sys.ipe_user as a Lua table. The rows element of that table is then passed to the apply function which passes each row in turn to the printRow function. The printRow function prints the NAME element of the row to the console, which is attached to the Kognitio Console Script Variables pane.

function apply(t, fn)
    for i, row in pairs(t) do fn(row) end
end
 
function printRow(row) print(row.NAME) end
 
apply( (select * from sys.ipe_user).rows, printRow) 

This feature currently has the limitation that it is not easy for the parser to determine where the SQL query ends. The code below fails because the ‘,’ character does not end an SQL query so it thinks that the , printRow is part of the query. The failure is at run time when the SQL is parsed in the Kognitio Analytical Platform and the SQL select * from sys.ipe_user, printRow errors.

function apply(t, fn)
    for i, row in pairs(t.rows) do fn(row) end
end
 
function printRow(row) print(row.NAME) end
 
apply(select * from sys.ipe_user, printRow) -- ERRORS: , does not end the SQL query

Instead we use ‘;’ to end the SQL expression:

apply(select * from sys.ipe_user; printRow) – Note use of ; to end SQL

Of particular use while debugging scripts is the form:

– any expression of the form print(<sql query>), e.g.
print(select * from sys.ipe_user) 

which outputs the formatted query results to Kognitio Console “Script Variables” pane.

Lua lfs module

The Lua File System module adds common file system manipulation commands to Lua. The lfs module is adapted from the Kepler project  and offers a portable way to access the underlying directory structure and file attributes.

lfs.attributes (filepath [, aname | atable])

Returns a table with the file attributes corresponding to filepath (or nil followed by an error message in case of error). If the second optional argument is given and is a string, then only the value of the named
attribute is returned (this use is equivalent to
 lfs.attributes(filepath)[aname], but the table is not created and only one attribute is retrieved from the O.S.). if a table is passed as the second argument, it is filled with attributes and returned instead of a new table. The attributes are described as follows; attribute mode is a string, all the others are numbers, and the time related attributes use the same time reference of os.time:

dev

on Unix systems, this represents the device that the inode resides on. On Windows systems, represents the drive number of the disk containing the file

ino

on Unix systems, this represents the inode number. On Windows systems this has no meaning

mode

string representing the associated protection mode (the values could be file, directory, link, socket, named pipe, char device, block device or other)

nlink

number of hard links to the file

uid

user-id of owner (Unix only, always 0 on Windows)

gid

group-id of owner (Unix only, always 0 on Windows)

rdev

on Unix systems, represents the device type, for special file inodes. On Windows systems represents the same as dev

access

time of last access

modification

time of last data modification

change

time of last file status change

size

file size, in bytes

permissions

file permissions string

blocks

block allocated for file; (Unix only)

blksize

optimal file system I/O blocksize; (Unix only)

This function uses stat internally thus if the given filepath is a symbolic link, it is followed (if it points to another link the chain is followed recursively) and the information is about the file it refers to. To obtain information about the link itself, see function lfs.symlinkattributes.

lfs.chdir (path)

Changes the current working directory to the given path.

Returns true in case of success or nil plus an error string.

lfs.currentdir ()

Returns a string with the current working directory or nil plus an error string.

iter, dir_obj = lfs.dir (path)

Lua iterator over the entries of a given directory. Each time the iterator is called with dir_obj it returns a directory entry’s name as a string, or nil if there are no more entries. You can also iterate by calling dir_obj:next(), and explicitly close the directory before the iteration finished with dir_obj:close(). Raises an error if path is not a directory.

lfs.link (old, new[, symlink])

Creates a link. The first argument is the object to link to and the second is the name of the link. If the optional third argument is true, the link will by a symbolic link (by default, a hard link is created).

lfs.lock (filehandle, mode[, start[, length]])

Locks a file or a part of it. This function works on open files; the file handle should be specified as the first argument. The string mode could be either r (for a read/shared lock) or w (for a write/exclusive lock). The optional arguments start and length can be used to specify a starting point and its length; both should be numbers.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

fs.lock_dir (path, [seconds_stale])

Creates a lockfile (called lockfile.lfs) in path if it does not exist and returns the lock. If the lock already exists checks if it’s stale, using the second parameter (default for the second parameter is INT_MAX, which in practice means the lock will never be stale. To free the the lock call lock:free()

In case of any errors it returns nil and the error message. In particular, if the lock exists and is not stale it returns the “File exists” message.

lfs.mkdir (dirname)

Creates a new directory. The argument is the name of the new directory.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

lfs.rmdir (dirname)

Removes an existing directory. The argument is the name of the directory.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

lfs.setmode (file, mode)

Sets the writing mode for a file. The mode string can be either “binary” or “text”.

Returns true followed the previous mode string for the file, or nil followed by an error string in case of errors. On non-Windows platforms, where the two modes are identical, setting the mode has no effect, and the mode is always returned as binary.

lfs.symlinkattributes (filepath [, aname])

Identical to lfs.attributes except that it obtains information about the link itself (not the
file it refers to). On Windows this function does not yet support links, and is identical to
 lfs.attributes.

lfs.touch (filepath [, atime [, mtime]])

Set access and modification times of a file. This function is a bind to utime function.

The first argument is the filename, the second argument (atime) is the access time, and the third argument (mtime) is the modification time. Both times are provided in seconds (which should be
generated with Lua standard function 
os.time). If the modification time is omitted, the access time provided is used; if both times are omitted, the current time is used.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

lfs.unlock (filehandle[, start[, length]])

Unlocks a file or a part of it. This function works on open files; the file handle should be specified as the first argument. The optional arguments start andlength can be used to specify a starting point and its length; both should be numbers.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

Improvements to Kognitio Script execution and debugging

Metatables with a __tostring method have been added to the table returned by an SQL query. This allows formatted printing of those tables.

t = select * from sys.ipe_user;
print(t) –[[ prints the text below:
|NAME         |ID|DFLT_SCNO|SEC_CLASS|SCH_CLASS|STATUS|PTIME              |
|SYS          |0 |0        |0        |0        |0     |2016-12-21 11:19:57|
|GRP_LOG      |3 |0        |0        |0        |8     |2016-12-21 11:20:15|
|GRP_MONITOR  |1 |0        |0        |0        |8     |2016-12-21 11:20:15|
|GRP_DISKUSAGE|4 |0        |0        |0        |8     |2016-12-21 11:20:15|
|GRP_DBG      |2 |0        |0        |0        |8     |2016-12-21 11:20:15|
]]

This may also be used in the kogscript.exe command line tool.

The SQL error string is added to the table returned from the SQL query, and this is also printed out by the metatable __tostring method

t = select * from nonexistent;
if t.status ~= 0 then 
    print("Query failed: " .. t.error) --[[ prints the text below:
    Query failed: ERROR: [Kognitio][WX2 Driver][latest] CI3013: Table SYS.NONEXISTENT does not exist, offset 15
    ]]
end
 
print(t) –[[ prints the text below:
ERROR: [Kognitio][WX2 Driver][latest] CI3013: Table SYS.NONEXISTENT does not exist, offset 15
]]

Breakpoints may now also be placed on Lua statements within the Kognitio Script, even if the “Trace Kog Script Lines” option is not set. After the breakpoint is hit, single stepping then steps through each Lua line.

Timestamps are now converted to a table with year, month, day, hour, min and sec fields:

t = select * from sys.ipe_allview_img;
list(t.rows[1].CREATE_TIME) –[[ prints the text below:
lua::number day = 16
lua::number min = 45
lua::number month = 1
lua::number year = 2017
lua::number hour = 10
lua::number sec = 34
]]

Playing Around With Kognitio On Hadoop: Hashes Cannot Always Anonymise Data

30

Jan
2017
Posted By : Chris Barratt Comments are off
kognitio data scripts how to

In this brave new world of big data one has to be careful of the loss of privacy with the public release of large data sets. Recently I reread “On Taxis And and Rainbows” that presents an account of how the author came to de-anonymise a data set made available by New York City’s Taxi and Limousine Commission. This data set included logs of each journey made: locations and times of pickups and drop-offs along with supposedly anonymised medallion numbers and other data. The author came to identify the medallion number data as the MD5 check-sums of the medallion numbers. They were then able to compute a table mapping 22 million possible medallion numbers to their MD5 check-sums. They reported that this took less than 2 minutes. This table allowed them to start ‘de-anonymising’ the data.

The article identified that this was possible because of the nature of the data. The set of all possible medallion numbers was known and small enough for a table mapping from the data for check-sums to be generated for each value in a feasible amount time.

Computing a large number of check-sums should be easy for a Kognitio system to handle. As several check-sum functions are implemented as SQL functions in the Kognitio system, I wondered how quickly Kognitio could generate a table mapping some potentially personally identifying information to check-sums using the functions.

As every UK address has a postcode I thought that postcodes would form a good candidate for a piece of personally identifying information. They are roughly equivalent to US ZIP codes. It seems plausible to me that someone might attempt anonymise postcodes using check-sums. So I thought I would try using Kognitio to compute tables of postcodes and their check-sums to see how quickly such an anonymisation strategy might be broken.

I found an almost complete list of valid postcodes in the “Code-Point Open” data set. Handily, this is freely available from the Ordnance Survey under the Open Government Licence. The data set contains about 1.7 million records each with a postcode field and other fields for related geographical data. The field titles are:

Postcode,Positional_quality_indicator,Eastings,Northings,Country_code,NHS_regional_HA_code,NHS_HA_code,Admin_county_code,Admin_district_code,Admin_ward_code

The data set is provided as a zip archive containing 120 CSV files with a total uncompressed size of 154MiB.

Getting The Kognitio System Check-sum Functions Timed

For the check-summing task I created a new Kognitio on Hadoop system (the ‘Kognitio system’ from here on) to run on a small developmental Hadoop cluster that I have access to. I copied the CSV files into a new HDFS directory on the cluster running the Kognitio system.

For all the queries/SQL commands I used the ‘sys’ superuser of the Kognitio system. I was the only one who was going to use the Kognitio system so there was no need to protect data from anyone but myself and I could avoid setting up unneeded users and privileges for anything in the system.

I created an external table to give the Kognitio system access to the CSV data on HDFS. This kind of table can’t be written to but can be queried like any other ‘regular’ table on the Kognitio system. Then I created a view on the external table to project only the postcodes. I followed this by creating a view image on the view to pin the postcode only rows of the view in memory. It is faster to access rows in memory than anywhere else.

Then I created several ram-only tables to store postcodes with their check-sums. Ram-only tables as the name suggests do not store row data on disk and so using them avoids any slow disk operation overheads.

Finally I ran and timed a series of insert-select queries that computed the check-sums for our postcodes and inserted them into the ram-only tables.

Some Background On External Tables, Connectors and Plugin Modules

Kognitio’s external tables present data from external sources as a set of records of fields using a connector. The list of fields and the connector are specified by the create external table statement.

The connector in turn specifies the data provider to use along with any connector specific configuration parameter settings required. The data provider implements the interface to the external data source for the internals of the Kognitio system. Data providers are also known as connectors but I wanted to avoid potential confusion from shared names.

The data provider that I used to access the HDFS data is implemented as a plugin connector provided by a plugin module. Plugin Modules are how Kognitio system implements some features that are not in the core code. Loading and activating the plugin modules is required to make their features available to the Kognitio system. Plugin modules can also present parameters used for any module specific configuration required.

Creating The External Table: Making the Code-Point Open Data Accessible to the Kognitio System

The external table that I needed to access the data on the HDFS system required a connector that used the HDFS data provider. This is implemented as a plugin connector from the Hadoop plugin module. So I loaded the Hadoop plugin module, made it active and set the libhdfs and hadoop_client plugin module parameters.

create module hadoop;
alter module hadoop set mode active;
alter module hadoop set parameter libhdfs to '/opt/cloudera/parcels/CDH/lib64/libhdfs.so';
alter module hadoop set parameter hadoop_client to '/opt/cloudera/parcels/CDH/bin/hadoop';

The parameters indicate to the HDFS data provider where the libhdfs library and hadoop executable are on my Hadoop cluster.

Then I could create the connector that was needed:

create connector hdfs_con source hdfs target ' namenode hdfs://172.30.251.105:8020, bitness 64 ';

The create connector statement specifies where the namenode is and to use 64 bit mode, which is best.

Then I could create the external table:

create external table OSCodePoints
(
PostCode char(8),
Positional_quality_indicator varchar(32000),
Eastings varchar(32000),
Northings varchar(32000),
Country_code varchar(32000),
NHS_regional_HA_code varchar(32000),
NHS_HA_code varchar(32000),
Admin_county_code varchar(32000),
Admin_district_code varchar(32000),
Admin_ward_code varchar(32000)
)
from hdfs_con target
'file "/user/kodoopdev1/OSCodePoint2016Nov/*.csv"'
;

The ‘file’ parameter specifies which HDFS files contain the data I want to access.

I created a view with a view image:

create view PostCodes as select PostCode from OSCodePoints;
create view image PostCodes;

The view projects just the postcode column out. Creating the view image pins the results from the view into RAM. All queries on the view read data from the view image and ran faster as a result.

I generated check-sums using the md5_checksum, sha1_checksum, sha224_checksum, sha256_checksum, sha384_checksum and sha512_checksum functions. These are all implemented as plugin functions provided by the hashes plugin module. This was loaded and activated:

create module hashes;
alter module hashes set mode active;

I ran the check-sum functions with seven different insert-select queries to populate a set of lookup tables that mapped postcodes to their check-sums. Six of the queries generated a check-sum for one of each of the check-sum functions. The 7th query used all 6 check-sum functions to generate a lookup of table for all 6 check-sums simultaneously. This was done to get some data indicating what the overhead associated with retrieving and creating rows was. The 7 lookup tables were created as ram-only tables. As the name implies these hold the row data in RAM and not on disk and so removed the overhead of writing the rows to disk.

The ram-only table create statements were:

create ram only table PostCodeMD5Sums
(
PostCode char(8),
md5_checksum binary(16)
);


create ram only table PostCodeSHA1Sums
(
PostCode char(8),
sha1_checksum binary(20)
);


create ram only table PostCodeSHA224Sums
(
PostCode char(8),
sha224_checksum binary(28)
);


create ram only table PostCodeSHA256Sums
(
PostCode char(8),
sha256_checksum binary(32)
);


create ram only table PostCodeSHA384Sums
(
PostCode char(8),
sha384_checksum binary(48)
);


create ram only table PostCodeSHA512Sums
(
PostCode char(8),
sha512_checksum binary(64)
);


create ram only table PostCodeCheckSums
(
PostCode char(8) CHARACTER SET LATIN1,
md5_checksum binary(16),
sha1_checksum binary(20),
sha224_checksum binary(28),
sha256_checksum binary(32),
sha384_checksum binary(48),
sha512_checksum binary(64)
);

The insert-select queries were:

insert into PostCodeMD5Sums
select
PostCode,
md5_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA1Sums
select
PostCode,
sha1_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA224Sums
select
PostCode,
sha224_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA256Sums
select
PostCode,
sha256_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA384Sums
select
PostCode,
sha384_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA512Sums
select
PostCode,
sha512_checksum(PostCode)
from
PostCodes
;


insert into PostCodeCheckSums
select
PostCode,
md5_checksum(PostCode),
sha1_checksum(PostCode),
sha224_checksum(PostCode),
sha256_checksum(PostCode),
sha384_checksum(PostCode),
sha512_checksum(PostCode)
from
PostCodes
;

Results And Conclusions

For the 1.7 million rows (1691721 to be exact) the times for the insert-select queries were:

md5_checksum: 0.62s
sha1_checksum: 0.53s
sha224_checksum: 0.63s
sha256_checksum: 0.67s
sha384_checksum: 0.83s
sha512_checksum: 0.86s
all 6 checksums: 2.87s

The results show that the check-sum for each postcode is being calculated in something around 500ns or less. This system could calculate about 170 billion check-sums a day. The Kognitio system was running on a node with two E5506s, giving 8 cores running at 2.13GHz, the node is several years old. It seems obvious to me the just using check-sums and/hashes alone isn’t going to hide data with this system, even more so with faster up-to-date hardware.

Comparing the “all 6 check-sums” insert-select query run time with those of the others did show that there was an appreciable overhead in retrieving and writing the row data. The total time taken to populate 6 lookup tables of just one check-sum value was 1.44 times more than the time taken to populate the one lookup table of all six check-sum functions.

Because I used the Code-Point Open data:
Contains OS data © Crown copyright and database right (2016)

Contains Royal Mail data © Royal Mail copyright and Database right (2016)

Contains National Statistics data © Crown copyright and database right (2016)

Beginner tips for Kognitio scripting

24

Jan
2017
Posted By : Chak Leung Comments are off
Kognitio scripts, lua
Categories :Blog, Kognitio How To

Standard ANSI SQL, whilst a great language for database querying, doesn’t allow for commonly used features like looping or extensive variable usage. There are variants which support some of this e.g. Transact-SQL but not on Kognitio until the introduction of Kognitio scripts.

But 2 languages in one script? How does this work?

Kognitio scripts (also referred to as ‘Kog’ scripts) are based on the Lua language where SQL statements can be written together with Lua code for extra functionality. They allow typical procedural language operations without the need of extensive workarounds in a non-procedural language like SQL.

It can be a bit overwhelming for SQL only users so we’ll go over some useful tips for getting started, for more in depth reading refer to the console scripting guide on the forum.

To work with Kog scripts in Kognitio Console make sure that your scripting pane options tab has settings similar to those below. This will allow you to work effectively in debug mode.

 

Variable and function integration (for management)

Set variables via:

use_schema = ''
use_table = 'example'

To call these arguments in SQL just add a $ before the variable name:

create table $use_schema.$use_table(mycolumn varchar);

This is very useful if users share a generic script to be used on different data sets. Rather than changing the schema/table for every query in the script, managing all the variables at the top of the script makes changing the source data/script parameters set much more organised.

Note that if you were to use variables in a Lua only code then you don’t need to reference them with “$”. Here the variable “a” is used in a loop:

a = 0
for i=1,5 do
    a = a +1;
end

Functions work in a similar way:

function square_num(a)
return a*a;
end
select $square_num(value) from values between 1 and 5;

Remember that Lua is case sensitive where it’s typically written in lower case (“and” is a reserved word but “And” or “AND” could be used as variable names).

The variable name isn’t restricted to alphanumeric characters, however if it doesn’t begin with one then you’ll need to enclose with brackets e.g. “$(_myvar)”. You can also reference a Lua table this way “$(myluatable[1])” which refers to the first column.

When stepping through a Kog script in Kognitio Console, you can always view the currently defined variables via the script variables pane (view > panes > script variables).

Try: Managing variables can be extended further via local variables, try creating two functions which take a global variable as an argument and alter it to suit the function need. Further this by using the global table _G..

Saving results to Lua tables e.g. column names/table names

You may have used the feature “$Col1” in Kognitio’s SQL scripting to retrieve a result from a previously run query. However this limits you to only the first row of results. A solution to this is to save them to a Lua table.

Declaring a table and insert via:

mytable= {};
table.insert(mytable,'element_1');
print(mytable[1]);

To set predefined elements:

mytable {'element_1','element_2'}

To save results from an SQL query:

average_values = select avg(value), avg(value*value) from values between 1 and 10;

Example: Save column names for an SQL table/view into a Lua table

mycolumns = select seq,name 
from SYS.IPE_COLUMN
where table_id in (select table_id 
                   from SYS.IPE_VIEW
                   where schema_id in (select id from SYS.IPE_ALLSCHEMA
                                       where name = '')
                   and name = ''
                   )
order by 1;
print(mycolumns.rows[1].NAME);

This saves the column names from a specified SQL table into a Lua table and prints the name of the first column. Some return values for Lua tables taken from our console scripting guide:

Table Variable Value
numRows Integer, Number of rows in the resultset.
numCols Integer, Number of columns in the resultset.
Status Integer, 0 = OK
colNames Table
Rows Table

Try: You’ll notice that you can also access a Lua tables column names via the colNames table variable so you can save an SQL table into a Lua table and call the colNames variable to get them rather than fetching it from the SYS tables.

Note: Returning results of a SQL query to lua tables should be restricted to small result sets as these are stored in the memeory of your local system where Kognitio console is running.

Looping with Lua and SQL

A basic version is possible in ANSI SQL but becomes a bit limited as you’re only able to work with an incrementing defined counter. In Kog scripting you can perform bread and butter loops like the classic “for” and “while” loops. These let you set index limitations as well as have an index paired alongside a value and iterate over Lua defined constructs like arrays and tables.

for i=1,5 do
    ...    
end
while i <5 do
   ...    
end

The “pairs” notation to include an index:

for a,b in pairs() do
   ...    
end

To integrate this with SQL queries just add them in to the loop, remember to end each query with a semicolon.

for i=1,5 do
    select avg(value) from values between 1 and $i ;
end

Example: Writing SQL automatically via concatenation (when there’s lots to write)

In the previous tip we saw the we can saved the results of an SQL query to a Lua table, we can now iterate through them by calling the table name and adding a suffix for the rows. Let’s take the column names of a previous table for example.

mycolumns = select seq,name 
from SYS.IPE_COLUMN
where table_id in (
                   select table_id from SYS.IPE_VIEW
                   where schema_id in (select id from SYS.IPE_ALLSCHEMA
                                       where name = '')
                   and name = ''
                   )
order by 1
;

And apply it to our concatenation. The aim is to write the SQL syntax for every column to calculate the average and maximums whilst excluding the unique ID column from this.

myschema ='CHAK'
mytable = 'CONCAT_STRINGS_EG '

--Table and view columns are stored in different system tables, 
--toggle with flag below depending on data source
--1= table, 2= view
type_flag = 1
if type_flag==1 then
    table_type = "IPE_TABLE"
    selection = 'id'
elseif type_flag==2 then
    table_type = "IPE_VIEW"
    selection = 'table_id'
end

drop table $myschema.$mytable;
create table $myschema.$mytable 
(id int, sale_price decimal(5,2), amount_paid decimal(5,2), no_items int);

insert into $myschema.$mytable
select 
row_number() over(),value*(value mod 5), ceiling(value*(value mod 5)),generate_key() 
from values between 1 and 5;

mycolumns = select seq,name from "SYS"."IPE_COLUMN"
where table_id in (
    select $selection from "SYS".$table_type
    where schema_id in (select id from "SYS"."IPE_ALLSCHEMA"
                       where name = '$myschema')
    and name = '$mytable'
    )
order by 1
;

col_average = ''
col_maximum = ''
for x,y in pairs(mycolumns.rows) do
   --Skip x=1 because that is the ID and we don't want to calculate metrics for it
    if x==2 then
        col_average = 'avg('..y.NAME..') avg_'..y.NAME
        col_maximum = 'max('..y.NAME..') max_'..y.NAME
    elseif x>2 then
        col_average = col_average..',avg('..y.NAME..') avg_'..y.NAME
        col_maximum = col_maximum..',max('..y.NAME..') max_'..y.NAME
    end
end

select id, $col_average from $myschema.$mytable group by 1;
select id, $col_maximum from $myschema.$mytable group by 1;

This will concatenate each of the 3 column names so that you can insert into another SQL statement later via “$col_average” and “$col_maximum” saving the need to type out each column name especially if this code is required in different queries. Extremely useful when producing a generic version for different datasets as the column names don’t need to be redefined. You could even wrap the whole code in a “for” loop and repeat this for different tables or views.

Try: Note that x is the index and y is the value in the above. Since SQL columns are separated by a comma in the coding, we use the index x to determine the first column and not add a comma in front of it. Conversely you could change it so that it only adds a comma when the index is not the first one.

Kognitio modules (Section off code)

As you build up your code you may find that you have functions which can be useful elsewhere i.e. other scripts where you can just copy them across but this can make scripts unnecessarily verbose. Instead putting them in Kog modules allows you to put them into another file to be called up later when needed.
Firstly you’ll need to set the path to your Kog modules. The example below is for a local machine folder but this will also work for network drives allowing multiple users to contribute and access them.

_G.package.path= _G.package.path.. ";?.kog";

Then you can load any files in there them via “require” similar to the module/library/package loading in other languages:

 scr1 = require "Kog_Module_Script1"
 scr1.run()

The first line designates that module as scr1 and the second runs it. In our example we’re going to run two modules depending on the time of day: the average during the day and the sum at the end of the day.

Getting the times:

select time '09:00:00';
stime = Col1;
print(stime)
select time '17:00:00';
etime =  Col1;
print(etime)
ctime=os.date("%X");
print(ctime)

if ctime > stime and ctime < etime then
    mod1 = require ""
    mod1.run()
else if ctime > etime then
    mod2 = require ""
    mod2.run()
else print("Not ready")
end

Which opens the module:

local x = {}
x.name = "Average Script"
select avg(value) from values between 1 and 5;

or:

local x = {}
x.name = "Sum Script"
select sum(value) from values between 1 and 5;

depending on the time of day. The idea being that you can also define functions in there and build up libraries to call in future development rather than inserting copious amounts of code from past scripts and making it very verbose.

Try: Replace the data source in the SQL scripts with your own data set and then insert a function into each script, starting off by printing the time of day and building up the complexity.

Using Kognitio on Amazon Elastic Map/Reduce

12

Jan
2017
Posted By : Andy MacLean Comments are off
Kognitio on Amazon EMR

Using Kognitio on Amazon Elastic Map Reduce

Amazon’s Elastic Map/Reduce product provides Hadoop clusters in the cloud. We’ve had several requests for the Hadoop version of our product to work with EMR. As of release 8.1.50-rel161221 we have made the two products compatible so you can use EMR to run Kognitio clusters. This article will show you how to get Kognitio clusters up and running on EMR.

In order to run Kognitio on EMR you will need:

This article assumes some basic familiarity with Amazon’s environment and the EMR feature so if you’re new to Amazon you’ll probably want to experiment with it a little first before trying to create a large Kognitio cluster. I’m also assuming that you’re creating a brand new EMR cluster just for Kognitio. If you want to integrate Kognitio with an existing EMR cluster you will need to modify these instructions accordingly.

Getting ready to start

Before you start you’re going to need to decide how to structure the Hadoop cluster and how the Kognitio cluster will look on it. Amazon clusters consist of various groups of nodes – the ‘master node’, which runs Hadoop specific cluster master programs like the HDFS namenode and Yarn resource manager, the ‘Core’ group of nodes, which hold HDFS data and run Yarn containers and optional extra ‘Task’ groups, which run Yarn jobs but don’t hold HDFS data. When running on Hadoop, Kognitio runs as a Yarn application with one or more controlling ‘edge nodes’ that also act as gateways for clients. The Kognitio software itself only needs to be installed on the edge node(s) as the user running it, it gets transfered to other nodes as part of the Yarn task that runs it.

For most EMR clusters it makes sense to use the EMR master node as the Kognitio edge node so that’s how this example will work. There are other possible choices here – you can just use one of the cluster nodes, you can spin up a specific task group node to run it or you can just have an arbitrary EC2 node with the right security settings and client software installed. However the master node is already doing similar jobs and using it is the simplest way to get up and running. For the rest of the cluster it’s easiest to have no task groups and run the whole application on Core nodes, although using task groups does work if you need to do that.

Configuring the master node

The master node also needs to be configured so that it can be used as the controlling ‘edge node’ for creating and managing one or more Kognitio clusters. For this to work you need to create a user for the software to run as, set it up appropriately and install/configure the Kognitio software under that user. Specifically:

  • Create a ‘kodoop’ user
  • Create an HDFS home directory for it
  • Setup authentication keys for it
  • Unpack the kodoop.tar.gz and kodoop_extras.tar.gz tarballs into the user’s home directory
  • Configure slider so it can find the zookeeper cluster we installed
  • Configure the Kognitio software to make clusters that use compressed messages

You can do this with the following shell script:

#!/bin/bash

#change the s3 bucket for your site
S3BUCKET=s3://kognitio-development

sudo useradd -c "kodoop user" -d /home/kodoop -m kodoop
HADOOP_USER_NAME=hdfs hadoop fs -mkdir /user/kodoop
HADOOP_USER_NAME=hdfs hadoop fs -chown kodoop /user/kodoop
sudo cp -r ~ec2-user/.ssh ~kodoop
sudo chown -R kodoop ~kodoop/.ssh

aws s3 cp $S3BUCKET/kodoop.tar.gz /tmp
aws s3 cp $S3BUCKET/kodoop-extras.tar.gz /tmp

sudo su - kodoop <<EOF
tar -xzf /tmp/kodoop.tar.gz
tar -xzf /tmp/kodoop-extras.tar.gz
echo PATH=~/kodoop/bin:\\\$PATH >>~/.bashrc

hn=`hostname`
grep -v '<\/configuration>' kodoop/slider/conf/slider-client.xml >/tmp/slider-client.xml
cat <<XXX >>/tmp/slider-client.xml
  <property>
    <name>slider.zookeeper.quorum</name>
    <value>\$hn:2181</value>
  </property>
</configuration>
XXX
cp  kodoop/slider/conf/slider-client.xml  kodoop/slider/conf/slider-client.xml.orig
cp /tmp/slider-client.xml  kodoop/slider/conf/slider-client.xml

cat >kodoop/config/server_defaults.cfg <<XXX
[runtime parameters]
rs_messcomp=1    ## turn on message compression
XXX
EOF

This script creates the user first, then it pulls the tarballs from an s3 bucket called s3://kognitio-development (You’ll want to change that to be your own bucket’s name and upload the tarballs into it). It then switches to the kodoop user, extracts everything and configures slider. The slider configuration required is the location of the zookeeper server which was installed with the cluster. This will be on port 2181 of the master node and this is the information that goes into slider-client.xml.

The final part of the script defines the rs_messcomp=1 setting for Kognitio clusters created on the EMR instance. This setting enables message compression, which causes messages to get compressed (with the LZ4 compression algorithm) before being sent over a network. This setting is not normally used but we recommend it for Amazon because the network:cpu speed ratio is such that it results in a speedup.

You can transfer this script to the master node and run it as ec2-user once the cluster starts, but it’s a lot nicer to have this run automatically as part of the cluster startup. You can do this by transfering the script to S3 and putting it together in a directory with the tarballs (and editing the s3 bucket name in the script appropriately). You can then specify the script during cluster creation as a custom action to get it run automatically (see below).

Creating the EMR cluster

Go to the Amazon EMR service in the AWS web console and hit ‘create cluster’ to make a new EMR cluster. You will then need to use ‘go to advanced options’ because some of the settings you need are not in the quick options. Now you have 4 pages of cluster settings to go through in order to define your cluster. Once you’ve done this and created a working cluster you will be able to make more clusters by cloning and tweaking a previous one or by generating a command line and running it.

This section will talk you through the settings you need to get a Kognitio cluster running without really getting into the other settings available. The settings I don’t mention can be defined any way you like.

Software Selection and Steps

Choose ‘Amazon’ as the vendor, select the release you want (we’ve tested it with emr-5.2.1 at the time of writing). Kognitio only needs Hadoop and Zookeeper to be selected from the list of packages, although adding others which you may need to run alongside it won’t hurt.

In the ‘Edit software settings’ box you may find it useful to enter the following:

[{"classification":"core-site","properties":{"yarn.nodemanager.delete.debug-delay-sec":"3600"}}]

This instructs yarn to preserve container directories for 1 hour after a container exits, which is very useful if you need to do any debugging.

If you want to have the master node configured automatically as discussed above, you will need to add an additional step here to do that. You can add a step by setting the step type to ‘Custom JAR’ and clicking configure. The Jar Location field should be set to s3://elasticmapreduce/libs/script-runner/script-runner.jar (if you like you can do s3://<regionname>.elasticmapreduce/ to make this a local read) and the argument is the full s3 path for the script you uploaded to s3 in the section above (e.g. s3://kognitio-development/kog-masternode). The script will now run automatically on the masternode after startup and the cluster will come up with a ‘kodoop’ user created and ready to go.

Hardware Selection

In the hardware selection page you need to tell EMR how many nodes to use and which type of VM to use for them. Kognitio doesn’t put much load on the master node so this can be any instance type you like, the default m3.xlarge works well.

The Core nodes can generally be anything which has enough memory for your cluster and the right memory:CPU ratio for you. For optimal network performance you should use the largest of whatever node type instead of a larger number of smaller instances (so 3x r3.8xlarge instead of 6x r3.4xlarge for example). The r3.8xlarge or m4.16xlarge instance types are good choices. You will want to use more RAM than you have data because of the Hadoop overhead and the need for memory workspace for queries. A good rule of thumb is to have the total RAM of the nodes which will be used for the Kognitio cluster be between 1.5x and 2x the size of the raw data you want to load as memory images.

You won’t need any task groups for this setup.

General Cluster Settings and Security

In the ‘General Cluster Settings’ pane you will want to add a bootstrap action for your node. This is required because the AMI used by EMR needs to have a small amount of configuration done and some extra Linux packages installed in order for it to run Kognitio’s software. The best way to do this is to place a configuration script in an S3 bucket and define this as a ‘custom action’ boostrap action. The following script does everything you need:

#!/bin/bash

sudo yum -y install glibc.i686 zlib.i686 openssl.i686 ncurses-libs.i686
sudo mount /dev/shm -o remount,size=90%
sudo rpm -i --nodeps /var/aws/emr/packages/bigtop/hadoop/x86_64/hadoop-libhdfs-*

This script installs some extra Linux packages required by Kognitio. Then it remounts /dev/shm to allow shared memory segments to use up to 90% of RAM. This is necessary because Kognitio clusters use shared memory segments for nearly all of the RAM they use. The final step looks a bit unusual but Amazon doesn’t provide us with a simple way to do this. Kognitio requires libhdfs but Amazon doesn’t install it out of the box unless you install a component which uses this. Amazon runs the bootstrap action before the relevant repositories have been configured on the node so the RPM can’t be installed via yum. By the time we come to use libhdfs all the dependencies will be in place and everything will work.

Finally, the Kognitio server will be accessible from port 6550 on the master node so you may want to configure the security groups in ‘EC2 Security Groups’ to make this accessible externally.

Creating a Kognitio cluster

Once you have started up your cluster and created the kodoop user (either manually or automatically), you are ready to build a Kognitio cluster. You can ssh into the master node as ‘kodoop’ and run ‘kodoop’. This will invite you to accept the EULA and display some useful links for documentation, forum support, etc that you might need later. Finally you can run ‘kodoop testenv’ to validate that the environment is working properly.

Once this is working you can create a Kognitio cluster. You will create a number of Yarn containers with a size you specify. You will need to choose a container size, container vcore count and a number of containers that you want to use for the cluster. Normally you’ll want to use a single container per node which uses nearly all of the memory. You can list the nodes in your cluster on the master node like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -list
17/01/09 18:40:26 INFO client.RMProxy: Connecting to ResourceManager at ip-172-40-0-213.eu-west-1.compute.internal/172.40.0.213:8032
Total Nodes:3
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
ip-172-40-0-91.eu-west-1.compute.internal:8041          RUNNING ip-172-40-0-91.eu-west-1.compute.internal:8042                             1
ip-172-40-0-126.eu-west-1.compute.internal:8041         RUNNING ip-172-40-0-126.eu-west-1.compute.internal:8042                            2
ip-172-40-0-216.eu-west-1.compute.internal:8041         RUNNING ip-172-40-0-216.eu-west-1.compute.internal:8042                            1

Then for one of the nodes, you can find out the resource limits like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -status ip-172-40-0-91.eu-west-1.compute.internal:8041
17/01/09 18:42:07 INFO client.RMProxy: Connecting to ResourceManager at ip-172-40-0-213.eu-west-1.compute.internal/172.40.0.213:8032
Node Report : 
        Node-Id : ip-172-40-0-91.eu-west-1.compute.internal:8041
        Rack : /default-rack
        Node-State : RUNNING
        Node-Http-Address : ip-172-40-0-91.eu-west-1.compute.internal:8042
        Last-Health-Update : Mon 09/Jan/17 06:41:43:741UTC
        Health-Report : 
        Containers : 0
        Memory-Used : 0MB
        Memory-Capacity : 253952MB
        CPU-Used : 0 vcores
        CPU-Capacity : 128 vcores
        Node-Labels :

The ‘Memory-Capacity’ field here shows the maximum container size you can create and CPU-Capacity shows the largest number of vcores. In addition to the Kognitio containers, the cluster also needs to be able to create a 2048MB application management container with 1 vcore. If you set the container memory size to be equal to the capacity and put one container on each node then there won’t be any space for the management container. For this reason you should subtract 1 from the vcore count and 2048 from the memory capacity.

You will also need to choose a name for the cluster which must be 12 characters or less and can only contain lower case letters, numbers and an underscore. Assuming we call it ‘cluster1’ we would then create a Kognitio cluster on the above example cluster like this:

CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1

This will display the following and invite you to confirm or cancel the operation:

[kodoop@ip-172-40-0-213 ~]$ CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1
Kognitio Analytical Platform software for Hadoop ver80150rel170105.
(c)Copyright Kognitio Ltd 2001-2017.

Creating Kognitio cluster with ID cluster1
=================================================================
Cluster configuration for cluster1
Containers:               3
Container memsize:        251904 Mb
Container vcores:         127

Internal storage limit:   100 Gb per store
Internal store count:     3

External gateway port:    6550

Kognitio server version:  ver80150rel170105

Cluster will use 738 Gb of ram.
Cluster will use  up to 300 Gb of HDFS storage for internal data.

Data networks:            all
Management networks:      all
Edge to cluster networks: all
Using broadcast packets:  no
=================================================================
Hit ctrl-c to abort or enter to continue

If this looks OK, hit enter and the cluster will be created. Once creation is completed you will have a working Kognitio server up and running and ready to use.

Next steps

At this point you should have a working Kognitio cluster up and ready to use. If you’re already a Kognitio user you probably know what you want to do next and you can stop reading here. This section is intended as a very brief quickstart guide to give new users an idea of the most common next steps. This is very brief and doesn’t cover all the things you can do. Full documentation for the features discussed below is available from our website.

You can download the Kognitio client tools from www.kognitio.com, install them somewhere, run Kognitio console and connect to port 6550 on the master node to start working with the server. Alternatively you can just log into the master node as kodoop and run ‘kodoop sql <system ID>’ to issue sql locally. Log in as ‘sys’ with the system ID as the password (it is a good idea to change this!).

There are now lots of different ways you can set up your server and get data into it but the most common thing to do is to build memory images (typically view images) to run SQL against. This is typically a two step process involving the creation of external tables which pull external data directly into the cluster followed by the creation of view images on top of these to pull data directly from the external source into a memory image. In some cases you may also want to create one or more regular tables and load data into them using wxloader or another data loading tool, in which case Kognitio will store a binary representation of the data in the HDFS filesystem.

Connecting to data in HDFS

Kognitio on Hadoop starts with a connector called ‘HDFS’ which is configured to pull data from the local HDFS filesystem. You create external tables which pull data from this either in Kognitio console or via SQL. To create external tables using console you can open the ‘External data sources’ part of the object tree and expand ‘HDFS’. This will allow you to browse the object tree from console and you’ll be able to create external tables by right clicking on HDFS files and using the external table creation wizard.

To create an external table directly from SQL you can use a syntax like this:

create external table name (<column list>) from HDFS target 'file /path/to/csv/files/with/wildcards';

Kognito is able to connect to a variety of different data sources and file formats in this manner. See the documentation for full details. As a quick example we can connect to a 6 column CSV file called test.csv like this:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from HDFS target 'file /path/to/file/test.csv';

If instead it is a directory full of csv files we can use ‘/path/to/file/test/*.csv’ instead to use them all as a single table in Kognitio.

Connecting to data in Amazon S3

Kognitio can also pull data directly out of Amazon S3. The Amazon connector is not loaded by default and it isn’t able to use the IAM credentials associated with the EMR nodes so you need to get a set of AWS credentials and configure your server with the following SQL:

create module aws;
alter module aws set mode active;
create group grp_aws;

create connector aws source s3 target 
'
accesskey YOUR_ACCESS_KEY
secretkey "YOUR_SECRET_KEY"
max_connectors_per_node 5
bucket your-bucket-name
';

grant connect on connector aws to grp_aws;
;

This sql loads the Kognitio Amazon plugin, creates a security group to allow access to it and then creates an external table connector which uses the plugin. You will need to give the connector some Amazon credentials where it says YOUR_ACCESS_KEY and YOUR_SECRET_KEY and you will need to point it at a particular storage bucket. If you want to have multiple storage buckets or use multiple sets of credentials then create multiple connectors and grant permission on different ones to appropriate sets of users. Granting the ‘connect’ permission on a connector allows users to make external tables through it. In this case you can just add them to the group grp_aws which has this.

max_connectors_per_node is needed here because the amazon connector gives out of memory errors if you try to run too many instances of it in parallel on each node.

Now an external table can be created in exactly the same way as in the HDFS example. If my amazon bucket contains a file called test.csv with 6 int columns in it I can say:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from AWS target 'file test.csv';

Creating memory images

Once you have external tables defined your server is ready to start running queries, but each time you query an object the server will go out to the remote data and pull it into the server. Kognitio is capable of running like this but most people prefer to create memory images and query those instead because this allows data to be queried very fast. There are several different kinds of memory image in Kognitio but the most commonly used images are view images. With a view image the user defines a view in the normal SQL way and then they image it, which makes an in-memory snapshot of the query. This can be done with this SQL:

create view testv as select * from test;
create view image testv;

So testv is now a memory image. Images can be created with various different memory distributions which tell the server which nodes will store which rows. The most common of these are:

  • Hashed — A hash function on some of the columns determines which nodes get which rows
  • Replicated — Every row goes to every ram processing task
  • Random — Just put the rows anywhere. This what we will get in the example above.

The various memory distributions can be used to help optimise queries. The server will move rows about automatically if they aren’t distributed correctly but placing rows so they are co-located with certain other rows can improve performance. As a general rule:

  • Small tables (under 100M in size) work best replicated
  • For everything else hash on the primary key except
  • For the biggest images which join to non-replicated tables hash on the foreign key to the biggest of the foreign tables
  • Use random if it isn’t obvious what else to use

And the syntax for these is:

create view image test replicated;
create view image test hashed(column, column, column);
create view image test random;

Imaging a view which queries one or more external tables will pull data from the external table connector straight into RAM without needing to put any of it in the Kognitio internal storage. Once the images are built you are ready to start running SQL queries against them.

Facebook

Twitter

LinkedId