Participate in the Kognitio Console beta test program

10

Mar
2017
Posted By : Michael Atkinson 0 Comment
Kognitio console, beta test program
Categories :Blog, Kognitio How To

Kognitio Console is Kognitio’s client side management program for the Kognitio Analytical Platform.

Some of its features are:

  • It allows inspection of the metadata tree for schemas, tables, views, external scripts, users, connectors, sessions, queues, etc.
  • It also gives an object view to each of these metadata objects, allowing their inspection and management.
  • There are also lots of tools, wizards and widgets to browse data in Hadoop, load and unload data, identify problem queries and many more.
  • There are also a set of reports and dashboards to monitor the state of Kognitio systems.
  • Macros may be written to extend Kognitio Console, the reports and dashboards are written in these XML macros.
  • Ad-Hoc queries may be executed.
  • KogScripts may be executed and debugged. KogScript is based on Lua but has Kognitio specific enhancements to run SQL natively.
  • It is an integrated development environment (IDE) for KogScripts, and external scripts running in Kognitio.

All this makes Kognitio Console a versatile tool, suitable for database admins, analysts and power users.

Kognitio Console is in constant development, beta and release candidate versions are provided from our update website. Beta and release candidates will be announced on the Kognitio Console forums

There are two ways of obtaining these betas:

Note that this is beta software, so has not gone through a full QA. However, we endeavour to make sure it has no obvious bugs and it will have passed our Console smoke tests. If you experience crashes or other bugs or deficiencies in these beta and release candidates, please report them to us, this will help us make the next release more stable.

By installing it into a different location than the default, you can have both this and the last stable releases installed at the same time.

The differences between KogScript running in Console and kogscript.exe

24

Feb
2017
Posted By : Michael Atkinson 0 Comment
KogScript, Lua, SQL, programming, how to, console window
Categories :Blog, Kognitio How To

KogScript is a general purpose programming language based on Lua. It may be run within Console, which acts as an integrated development environment for it, or run from the command line tool kogscript.exe.

The KogScript running within Console and via kogscript.exe are compiled from the same source files and are identical with the following exceptions.
 

sqldebug

In kogscript.exe the global variable sqldebug is set to 3. The sqldebug variable is used to control how much information is output on each sql query within the KogScript as it is executed. A level of 3 shows the query status, and table contents.

In Console sqldebug is not defined, so it acts as if it had a value of 0, this means no information is output to the terminal on sql execution, instead Console has integration of queries within the script view as info and table and stored in the script history.

withinConsole

When running within Console the global variable withinConsole is set to true
 

Console integration

When running within Console there are two extra global tables holding data and methods used to integrate scripts with Console (within the table kog) and access Qt objects and methods (within the table qt).

An example of how to use the qt methods may be found in the Examples tool as “Using Qt from Kog Scripts”. This shows how to open Qt dialogs, and use Qt grids and views.

The kog global table holds functions to access the main elements of the Console application user interface. From those it is possible to access the rest of the user interface. There are about 100 classes and thousands of objects available. To find out which methods, variables and enumerations are available for one of the kog functions use list(), as in:

mw = kog.mainWindow()
list(mw)

Which gives outout like:

 

Loader and Unloader

To run either of these from KogScript, the script needs to be run within Console as they make use of Console integration to run the Loader or Unloader tools built into Console. The example below shows how to run Unloader from within Console.

-- Get a reference to the console main window.
mw = kog.mainWindow();

config =[[
FILENAME=C:/TEMP/simplevector_test.csv
TABLE="MIKE"."A_SIMPLE_VECTOR"
AUTOUNLOAD=1
SILENT=1]];

-- Now run the unloader using the configuration above. 
mw:runUnloader(config)

 

Conclusion

KogScript within Console and the kogscript.exe command line tools are identical as far as running SQL queries and the Lua language.

Where they differ is that within Console there are extra features to enable integration with Console itself.

How to generate an identity / auto_increment column in Kognitio

17

Feb
2017
Posted By : Michael Atkinson Comments are off
Categories :Blog, Kognitio How To

Occasionally we would like the value of the primary key field to be created automatically every time a new record is inserted. That is we would like an auto-incrementable field in a table.

What other databases do

Other databases use a variety of syntax to do this.

MySQL uses an AUTO_INCREMENT keyword, which is used within the table definition

CREATE TABLE fruit (
         ID int NOT NULL AUTO_INCREMENT,
         name varchar(255) NOT NULL,
         PRIMARY KEY (ID) );

 

The SQL Server syntax uses the IDENTITY keyword, again within the table definition, in this example starting at 1 and incrementing by 1.

CREATE TABLE city (
         ID int IDENTITY(1,1) PRIMARY KEY,
         name varchar(255) NOT NULL,
         PRIMARY KEY (ID) );

 

For Oracle you first need to create a sequence and then use its nextval attribute during inserts.

CREATE SEQUENCE seq_people
        MINVALUE 1
        START WITH 1
        INCREMENT BY 1
        CACHE 10;
INSERT INTO People (ID,FirstName,LastName)
VALUES (seq_people.nextval,'Mike','Atkinson');

 

Kognitio’s row_number syntax.

Kognitio does not have a way of creating a unique value for a field directly in the table definition, instead it has two ways of adding unique values to resultsets; and then these resultsets may be inserted into a table. The first uses the ROW_NUMBER function.

The ROW_NUMBER function assigns to each row in the partition or query a sequence number starting from one. This is only a number used in the context of the result set, if the result changes, the ROW_NUMBER will change. The ROW_NUMBER expression takes an ORDER BY statement with the column to be used for the row count with an OVER operator.

In the example below we use ROW_NUMBER to replace the subject name with an id value which is a foreign key reference into the subjects table.

set schema yorschema;
create table facts_origin (subject varchar, predicate varchar, object varchar);
create table subjects (id int, name varchar not null, primary key(id) );
create table facts (subject int, predicate varchar, object varchar, 
               foreign key (subject) references subjects(id) );

insert into facts_origin values ('Peter', 'Brother', 'Sarah'), 
                                                           ('Peter', 'Brother', 'Anne'), 
                                                           ('John', 'Father', 'Peter');


insert into subjects
select with
    ids(name) as (select distinct subject from facts_origin)
select row_number() over (order by name) as "id", name  from ids;

insert into facts
    select s.id, f.predicate, f.object
        from facts_origin f, subjects s
        where f.subject=s.name;

select * from subjects;
select * from facts;
Which gives a subjects table of: And a facts table of:
ID NAME
1 John
2 Peter
SUBJECT PREDICATE OBJECT
1 Father Peter
2 Brother Anne
2 Brother Sarah

 

Kognitio’s generate_key syntax.

For an individual SELECT statement the GENERATE_KEY function generates a unique key for each row. These rows may then be inserted into a table.

The result is an INT8 data type with a value greater than or equal to zero. The results are not typically contiguous, but are guaranteed to be unique for an individual SELECT statement. The values generated are dependent upon the number of Kognitio nodes and the distribution of the data. Rerunning a query may not generate the same results. Multiple occurrences of GENERATE_KEY in a SELECT list will all produce the same result within a single row.

GENERATE_KEY can only be used in the SELECT list, it cannot be used in WHERE, HAVING, GROUP BY or ORDER BY clauses. Within the SELECT list you can perform arithmetic on the columns containing the GENERATE_KEY function.

insert into subjects
select with
    ids(name) as (select distinct subject from facts_origin)
select generate_key(), name  from ids;

The Kognitio SQL Guide has examples of good practice for ensuring that addition INSERT statements have unique keys and how to use them in ETL processes.

 

A more complete example

The following example shows many techniques, it uses both row_number and generate_key functions and illustrates how to use them to add entries to existing tables.

It also illustrates the use of KogScript functions, KogScript variables and a more complex “Grandparent” query.

set schema yourschema;

-- drop tables, they may have been created by previous runs of the script
drop table facts_origin;
drop table facts;
drop table entities;
drop table preds;

-- create tables to use
create table facts_origin (subject varchar, predicate varchar, object varchar);
create table entities (id int, name varchar not null, primary key(id) );
create table preds (id int, name varchar not null, primary key(id) );
create table facts (subject int, predicate int, object int,
               foreign key (subject) references entities(id),
               foreign key (predicate) references preds(id),
               foreign key (object) references entities(id) );

-- add some facts, a small genealogy
insert into facts_origin
     values ('Peter', 'Brother', 'Sarah'), ('Peter', 'Brother', 'Anne'), ('John', 'Father', 'Peter');
insert into facts_origin
     values ('Peter', 'Father', 'James'), ('Jane', 'Mother', 'Peter'), ('Mary', 'Mother', 'John');

-- create the entities table
insert into entities
select with
    ents(name) as (select distinct subject from facts_origin union select object from facts_origin)
select row_number() over (order by name) as "id", name  from ents;


-- create the preds table
insert into preds
select with
    ids(name) as (select distinct predicate from facts_origin)
select generate_key(), name  from ids;

-- create the facts table, this uses references to the entities and preds tables
insert into facts
    select s.id, p.id, o.id
        from facts_origin f, entities s, preds p, entities o
        where f.subject=s.name and f.predicate=p.name and f.object=o.name;

-- check the tables are what we expect
select * from entities;
select * from preds;
select * from facts;

-- Query for all facts where 'Peter' is the subject
with
  -- select the subject id
  s(id) as (select id from entities where name='Peter'),
  -- Filter the facts table for that subject id
  f(subject, predicate, object) as (select subject, predicate, object from facts, s 
                        where subject in (s.id) )
  -- Then join the filter results with the subjects table to get the subject name
select s.name as "SUBJECT", p.name, o.name as "OBJECT"
    from f, entities s, preds p, entities o 
    where f.subject=s.id and f.predicate=p.id and f.object=o.id;


-- Use a function to add entities if they do not already occur in the entities table.
function addEntity(p)
   local t = select id from entities where name='$p';
   if #t.rows~=1  then
       insert into entities
           select mk + 1, '$p' from (select max(id) from entities) dt(mk);
                 t = select id from entities where name='$p';
       assert(#t.rows==1)
   end
   return t.rows[1][1];
end

-- Use a function to add predicates if they do not already occur in the preds table.
function addPred(p)
   local t = select id from preds where name='$p';
   if #t.rows~=1  then
       insert into preds
           select generate_key() + mk + 1, '$p' from (select max(id) from preds) dt(mk);
                 t = select id from preds where name='$p';
       assert(#t.rows==1)
   end
   return t.rows[1][1];
end


-- adding facts
function addFact(s,p,o)
    local sub = addEntity(s) – use local to avoid polluting the global namespace
    local pred = addPred(p)
    local obj = addEntity(o)
    local t = select count(*) as "count" from facts
                               where subject=$sub and predicate=$pred and object=$obj;
    if #t.rows~=1 or t.rows[1].count==0 then
         insert into facts values ($sub,$pred,$obj);
    end

end

addFact('Garry', 'Father', 'Mary')


-- grandparent view.

parent = addPred('Parent')
grandparent = addPred('Grandparent')

-- Now form the query as a view, so that it may be reused easily
create view grandparent
select with
    -- Define the father_of subquery
    father_of(s,p,o) as  (select subject, predicate, object from facts
                 where predicate in (select id from preds where name='Father') ),
    -- Define the mother_of subquery
    mother_of(s,p,o) as (select subject, predicate, object from facts
                 where predicate in (select id from preds where name='Mother') ),
    -- Define the parent_of subquery, this uses the Parent predicate id we have defined above
    parent_of(s,p,o) as  (select s,$parent,o from father_of union
                 select s,$parent,o from mother_of),
    -- Define the grandparent_of subquery, this uses the Grandparent predicate id we 
    -- have defined above
    grandparent_of(s,p,o) as (select p.s, $grandparent, c.o from parent_of p, parent_of c where p.o=c.s)
-- Now output the result in a human readable form
select s.name as "SUBJECT", p.name as "PREDICATE", o.name as "OBJECT"
    from grandparent_of f, entities s, preds p, entities o 
    where f.s=s.id and f.p=p.id and f.o=o.id;


-- Now we can perform various queries
-- Who are John's grandchildren?
select object as "Grandchild" from grandparent where subject='John';

-- Who are Peter's grandparents?
select subject as "Grandparent" from grandparent where object='Peter';

-- Is Jane the grandparent of James?
select 'Yes' from grandparent where subject='Jane' and object='James';

And here is the entity diagram for the facts table

 

Summary

Various databases have different ways of creating unique values in a field. The Kognitio Analytical Platform has two methods GENERATE_KEY and ROW_NUMBER, both of which generate unique values in a subquery that may then be inserted into a table, giving a column in the table with a unique value as a result.

Kits, cats, sacks, wives: unfolding JSON data into tables

14

Feb
2017
Posted By : Graeme Cole Comments are off
unfolding json data into tables, csv
Categories :Blog, Kognitio How To
Tags :  ,,

Each wife had seven sacks…

Converting simple, flat JSON data into CSV or another tabular format is a problem addressed by many freely-available JSON-to-CSV converters. However, these are usually only useful when the JSON is a series of flat objects without any nested arrays.

In practice, real-world JSON files often have nested arrays containing more objects, each of which may contain arrays of their own. They tend to look like the old riddle As I was going to St Ives:

As I was going to St. Ives,
I met a man with seven wives.
Each wife had seven sacks,
Each sack had seven cats,
Each cat had seven kits.
Kits, cats, sacks, wives,
How many were there going to St. Ives?

(more…)

What’s new in Kognitio Console version 80199s170123

07

Feb
2017
Posted By : Michael Atkinson Comments are off
Whats new in Kognitio Console
Categories :Blog, Kognitio How To

SQL in expression contexts

 Up until this release SQL could only occur in Lua statements or assignments. For example in Kognitio Scripts we can write SQL within the script as a statement or assigning the results table to a variable.

drop table hr.employee;
t = select * from sys.ipe_user; 

This release of Kognitio Console adds the ability to place SQL in Lua expression contexts. So now within Kognitio Scripts we can place SQL in most places where a Lua expression may occur.

In the example below the data for the Kognitio users are read from the table sys.ipe_user as a Lua table. The rows element of that table is then passed to the apply function which passes each row in turn to the printRow function. The printRow function prints the NAME element of the row to the console, which is attached to the Kognitio Console Script Variables pane.

function apply(t, fn)
    for i, row in pairs(t) do fn(row) end
end
 
function printRow(row) print(row.NAME) end
 
apply( (select * from sys.ipe_user).rows, printRow) 

This feature currently has the limitation that it is not easy for the parser to determine where the SQL query ends. The code below fails because the ‘,’ character does not end an SQL query so it thinks that the , printRow is part of the query. The failure is at run time when the SQL is parsed in the Kognitio Analytical Platform and the SQL select * from sys.ipe_user, printRow errors.

function apply(t, fn)
    for i, row in pairs(t.rows) do fn(row) end
end
 
function printRow(row) print(row.NAME) end
 
apply(select * from sys.ipe_user, printRow) -- ERRORS: , does not end the SQL query

Instead we use ‘;’ to end the SQL expression:

apply(select * from sys.ipe_user; printRow) – Note use of ; to end SQL

Of particular use while debugging scripts is the form:

– any expression of the form print(<sql query>), e.g.
print(select * from sys.ipe_user) 

which outputs the formatted query results to Kognitio Console “Script Variables” pane.

Lua lfs module

The Lua File System module adds common file system manipulation commands to Lua. The lfs module is adapted from the Kepler project  and offers a portable way to access the underlying directory structure and file attributes.

lfs.attributes (filepath [, aname | atable])

Returns a table with the file attributes corresponding to filepath (or nil followed by an error message in case of error). If the second optional argument is given and is a string, then only the value of the named
attribute is returned (this use is equivalent to
 lfs.attributes(filepath)[aname], but the table is not created and only one attribute is retrieved from the O.S.). if a table is passed as the second argument, it is filled with attributes and returned instead of a new table. The attributes are described as follows; attribute mode is a string, all the others are numbers, and the time related attributes use the same time reference of os.time:

dev

on Unix systems, this represents the device that the inode resides on. On Windows systems, represents the drive number of the disk containing the file

ino

on Unix systems, this represents the inode number. On Windows systems this has no meaning

mode

string representing the associated protection mode (the values could be file, directory, link, socket, named pipe, char device, block device or other)

nlink

number of hard links to the file

uid

user-id of owner (Unix only, always 0 on Windows)

gid

group-id of owner (Unix only, always 0 on Windows)

rdev

on Unix systems, represents the device type, for special file inodes. On Windows systems represents the same as dev

access

time of last access

modification

time of last data modification

change

time of last file status change

size

file size, in bytes

permissions

file permissions string

blocks

block allocated for file; (Unix only)

blksize

optimal file system I/O blocksize; (Unix only)

This function uses stat internally thus if the given filepath is a symbolic link, it is followed (if it points to another link the chain is followed recursively) and the information is about the file it refers to. To obtain information about the link itself, see function lfs.symlinkattributes.

lfs.chdir (path)

Changes the current working directory to the given path.

Returns true in case of success or nil plus an error string.

lfs.currentdir ()

Returns a string with the current working directory or nil plus an error string.

iter, dir_obj = lfs.dir (path)

Lua iterator over the entries of a given directory. Each time the iterator is called with dir_obj it returns a directory entry’s name as a string, or nil if there are no more entries. You can also iterate by calling dir_obj:next(), and explicitly close the directory before the iteration finished with dir_obj:close(). Raises an error if path is not a directory.

lfs.link (old, new[, symlink])

Creates a link. The first argument is the object to link to and the second is the name of the link. If the optional third argument is true, the link will by a symbolic link (by default, a hard link is created).

lfs.lock (filehandle, mode[, start[, length]])

Locks a file or a part of it. This function works on open files; the file handle should be specified as the first argument. The string mode could be either r (for a read/shared lock) or w (for a write/exclusive lock). The optional arguments start and length can be used to specify a starting point and its length; both should be numbers.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

fs.lock_dir (path, [seconds_stale])

Creates a lockfile (called lockfile.lfs) in path if it does not exist and returns the lock. If the lock already exists checks if it’s stale, using the second parameter (default for the second parameter is INT_MAX, which in practice means the lock will never be stale. To free the the lock call lock:free()

In case of any errors it returns nil and the error message. In particular, if the lock exists and is not stale it returns the “File exists” message.

lfs.mkdir (dirname)

Creates a new directory. The argument is the name of the new directory.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

lfs.rmdir (dirname)

Removes an existing directory. The argument is the name of the directory.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

lfs.setmode (file, mode)

Sets the writing mode for a file. The mode string can be either “binary” or “text”.

Returns true followed the previous mode string for the file, or nil followed by an error string in case of errors. On non-Windows platforms, where the two modes are identical, setting the mode has no effect, and the mode is always returned as binary.

lfs.symlinkattributes (filepath [, aname])

Identical to lfs.attributes except that it obtains information about the link itself (not the
file it refers to). On Windows this function does not yet support links, and is identical to
 lfs.attributes.

lfs.touch (filepath [, atime [, mtime]])

Set access and modification times of a file. This function is a bind to utime function.

The first argument is the filename, the second argument (atime) is the access time, and the third argument (mtime) is the modification time. Both times are provided in seconds (which should be
generated with Lua standard function 
os.time). If the modification time is omitted, the access time provided is used; if both times are omitted, the current time is used.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

lfs.unlock (filehandle[, start[, length]])

Unlocks a file or a part of it. This function works on open files; the file handle should be specified as the first argument. The optional arguments start andlength can be used to specify a starting point and its length; both should be numbers.

Returns true if the operation was successful; in case of error, it returns nil plus an error string.

Improvements to Kognitio Script execution and debugging

Metatables with a __tostring method have been added to the table returned by an SQL query. This allows formatted printing of those tables.

t = select * from sys.ipe_user;
print(t) –[[ prints the text below:
|NAME         |ID|DFLT_SCNO|SEC_CLASS|SCH_CLASS|STATUS|PTIME              |
|SYS          |0 |0        |0        |0        |0     |2016-12-21 11:19:57|
|GRP_LOG      |3 |0        |0        |0        |8     |2016-12-21 11:20:15|
|GRP_MONITOR  |1 |0        |0        |0        |8     |2016-12-21 11:20:15|
|GRP_DISKUSAGE|4 |0        |0        |0        |8     |2016-12-21 11:20:15|
|GRP_DBG      |2 |0        |0        |0        |8     |2016-12-21 11:20:15|
]]

This may also be used in the kogscript.exe command line tool.

The SQL error string is added to the table returned from the SQL query, and this is also printed out by the metatable __tostring method

t = select * from nonexistent;
if t.status ~= 0 then 
    print("Query failed: " .. t.error) --[[ prints the text below:
    Query failed: ERROR: [Kognitio][WX2 Driver][latest] CI3013: Table SYS.NONEXISTENT does not exist, offset 15
    ]]
end
 
print(t) –[[ prints the text below:
ERROR: [Kognitio][WX2 Driver][latest] CI3013: Table SYS.NONEXISTENT does not exist, offset 15
]]

Breakpoints may now also be placed on Lua statements within the Kognitio Script, even if the “Trace Kog Script Lines” option is not set. After the breakpoint is hit, single stepping then steps through each Lua line.

Timestamps are now converted to a table with year, month, day, hour, min and sec fields:

t = select * from sys.ipe_allview_img;
list(t.rows[1].CREATE_TIME) –[[ prints the text below:
lua::number day = 16
lua::number min = 45
lua::number month = 1
lua::number year = 2017
lua::number hour = 10
lua::number sec = 34
]]

Playing Around With Kognitio On Hadoop: Hashes Cannot Always Anonymise Data

30

Jan
2017
Posted By : Chris Barratt Comments are off
kognitio data scripts how to

In this brave new world of big data one has to be careful of the loss of privacy with the public release of large data sets. Recently I reread “On Taxis And and Rainbows” that presents an account of how the author came to de-anonymise a data set made available by New York City’s Taxi and Limousine Commission. This data set included logs of each journey made: locations and times of pickups and drop-offs along with supposedly anonymised medallion numbers and other data. The author came to identify the medallion number data as the MD5 check-sums of the medallion numbers. They were then able to compute a table mapping 22 million possible medallion numbers to their MD5 check-sums. They reported that this took less than 2 minutes. This table allowed them to start ‘de-anonymising’ the data.

The article identified that this was possible because of the nature of the data. The set of all possible medallion numbers was known and small enough for a table mapping from the data for check-sums to be generated for each value in a feasible amount time.

Computing a large number of check-sums should be easy for a Kognitio system to handle. As several check-sum functions are implemented as SQL functions in the Kognitio system, I wondered how quickly Kognitio could generate a table mapping some potentially personally identifying information to check-sums using the functions.

As every UK address has a postcode I thought that postcodes would form a good candidate for a piece of personally identifying information. They are roughly equivalent to US ZIP codes. It seems plausible to me that someone might attempt anonymise postcodes using check-sums. So I thought I would try using Kognitio to compute tables of postcodes and their check-sums to see how quickly such an anonymisation strategy might be broken.

I found an almost complete list of valid postcodes in the “Code-Point Open” data set. Handily, this is freely available from the Ordnance Survey under the Open Government Licence. The data set contains about 1.7 million records each with a postcode field and other fields for related geographical data. The field titles are:

Postcode,Positional_quality_indicator,Eastings,Northings,Country_code,NHS_regional_HA_code,NHS_HA_code,Admin_county_code,Admin_district_code,Admin_ward_code

The data set is provided as a zip archive containing 120 CSV files with a total uncompressed size of 154MiB.

Getting The Kognitio System Check-sum Functions Timed

For the check-summing task I created a new Kognitio on Hadoop system (the ‘Kognitio system’ from here on) to run on a small developmental Hadoop cluster that I have access to. I copied the CSV files into a new HDFS directory on the cluster running the Kognitio system.

For all the queries/SQL commands I used the ‘sys’ superuser of the Kognitio system. I was the only one who was going to use the Kognitio system so there was no need to protect data from anyone but myself and I could avoid setting up unneeded users and privileges for anything in the system.

I created an external table to give the Kognitio system access to the CSV data on HDFS. This kind of table can’t be written to but can be queried like any other ‘regular’ table on the Kognitio system. Then I created a view on the external table to project only the postcodes. I followed this by creating a view image on the view to pin the postcode only rows of the view in memory. It is faster to access rows in memory than anywhere else.

Then I created several ram-only tables to store postcodes with their check-sums. Ram-only tables as the name suggests do not store row data on disk and so using them avoids any slow disk operation overheads.

Finally I ran and timed a series of insert-select queries that computed the check-sums for our postcodes and inserted them into the ram-only tables.

Some Background On External Tables, Connectors and Plugin Modules

Kognitio’s external tables present data from external sources as a set of records of fields using a connector. The list of fields and the connector are specified by the create external table statement.

The connector in turn specifies the data provider to use along with any connector specific configuration parameter settings required. The data provider implements the interface to the external data source for the internals of the Kognitio system. Data providers are also known as connectors but I wanted to avoid potential confusion from shared names.

The data provider that I used to access the HDFS data is implemented as a plugin connector provided by a plugin module. Plugin Modules are how Kognitio system implements some features that are not in the core code. Loading and activating the plugin modules is required to make their features available to the Kognitio system. Plugin modules can also present parameters used for any module specific configuration required.

Creating The External Table: Making the Code-Point Open Data Accessible to the Kognitio System

The external table that I needed to access the data on the HDFS system required a connector that used the HDFS data provider. This is implemented as a plugin connector from the Hadoop plugin module. So I loaded the Hadoop plugin module, made it active and set the libhdfs and hadoop_client plugin module parameters.

create module hadoop;
alter module hadoop set mode active;
alter module hadoop set parameter libhdfs to '/opt/cloudera/parcels/CDH/lib64/libhdfs.so';
alter module hadoop set parameter hadoop_client to '/opt/cloudera/parcels/CDH/bin/hadoop';

The parameters indicate to the HDFS data provider where the libhdfs library and hadoop executable are on my Hadoop cluster.

Then I could create the connector that was needed:

create connector hdfs_con source hdfs target ' namenode hdfs://172.30.251.105:8020, bitness 64 ';

The create connector statement specifies where the namenode is and to use 64 bit mode, which is best.

Then I could create the external table:

create external table OSCodePoints
(
PostCode char(8),
Positional_quality_indicator varchar(32000),
Eastings varchar(32000),
Northings varchar(32000),
Country_code varchar(32000),
NHS_regional_HA_code varchar(32000),
NHS_HA_code varchar(32000),
Admin_county_code varchar(32000),
Admin_district_code varchar(32000),
Admin_ward_code varchar(32000)
)
from hdfs_con target
'file "/user/kodoopdev1/OSCodePoint2016Nov/*.csv"'
;

The ‘file’ parameter specifies which HDFS files contain the data I want to access.

I created a view with a view image:

create view PostCodes as select PostCode from OSCodePoints;
create view image PostCodes;

The view projects just the postcode column out. Creating the view image pins the results from the view into RAM. All queries on the view read data from the view image and ran faster as a result.

I generated check-sums using the md5_checksum, sha1_checksum, sha224_checksum, sha256_checksum, sha384_checksum and sha512_checksum functions. These are all implemented as plugin functions provided by the hashes plugin module. This was loaded and activated:

create module hashes;
alter module hashes set mode active;

I ran the check-sum functions with seven different insert-select queries to populate a set of lookup tables that mapped postcodes to their check-sums. Six of the queries generated a check-sum for one of each of the check-sum functions. The 7th query used all 6 check-sum functions to generate a lookup of table for all 6 check-sums simultaneously. This was done to get some data indicating what the overhead associated with retrieving and creating rows was. The 7 lookup tables were created as ram-only tables. As the name implies these hold the row data in RAM and not on disk and so removed the overhead of writing the rows to disk.

The ram-only table create statements were:

create ram only table PostCodeMD5Sums
(
PostCode char(8),
md5_checksum binary(16)
);


create ram only table PostCodeSHA1Sums
(
PostCode char(8),
sha1_checksum binary(20)
);


create ram only table PostCodeSHA224Sums
(
PostCode char(8),
sha224_checksum binary(28)
);


create ram only table PostCodeSHA256Sums
(
PostCode char(8),
sha256_checksum binary(32)
);


create ram only table PostCodeSHA384Sums
(
PostCode char(8),
sha384_checksum binary(48)
);


create ram only table PostCodeSHA512Sums
(
PostCode char(8),
sha512_checksum binary(64)
);


create ram only table PostCodeCheckSums
(
PostCode char(8) CHARACTER SET LATIN1,
md5_checksum binary(16),
sha1_checksum binary(20),
sha224_checksum binary(28),
sha256_checksum binary(32),
sha384_checksum binary(48),
sha512_checksum binary(64)
);

The insert-select queries were:

insert into PostCodeMD5Sums
select
PostCode,
md5_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA1Sums
select
PostCode,
sha1_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA224Sums
select
PostCode,
sha224_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA256Sums
select
PostCode,
sha256_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA384Sums
select
PostCode,
sha384_checksum(PostCode)
from
PostCodes
;


insert into PostCodeSHA512Sums
select
PostCode,
sha512_checksum(PostCode)
from
PostCodes
;


insert into PostCodeCheckSums
select
PostCode,
md5_checksum(PostCode),
sha1_checksum(PostCode),
sha224_checksum(PostCode),
sha256_checksum(PostCode),
sha384_checksum(PostCode),
sha512_checksum(PostCode)
from
PostCodes
;

Results And Conclusions

For the 1.7 million rows (1691721 to be exact) the times for the insert-select queries were:

md5_checksum: 0.62s
sha1_checksum: 0.53s
sha224_checksum: 0.63s
sha256_checksum: 0.67s
sha384_checksum: 0.83s
sha512_checksum: 0.86s
all 6 checksums: 2.87s

The results show that the check-sum for each postcode is being calculated in something around 500ns or less. This system could calculate about 170 billion check-sums a day. The Kognitio system was running on a node with two E5506s, giving 8 cores running at 2.13GHz, the node is several years old. It seems obvious to me the just using check-sums and/hashes alone isn’t going to hide data with this system, even more so with faster up-to-date hardware.

Comparing the “all 6 check-sums” insert-select query run time with those of the others did show that there was an appreciable overhead in retrieving and writing the row data. The total time taken to populate 6 lookup tables of just one check-sum value was 1.44 times more than the time taken to populate the one lookup table of all six check-sum functions.

Because I used the Code-Point Open data:
Contains OS data © Crown copyright and database right (2016)

Contains Royal Mail data © Royal Mail copyright and Database right (2016)

Contains National Statistics data © Crown copyright and database right (2016)

Beginner tips for Kognitio scripting

24

Jan
2017
Posted By : Chak Leung Comments are off
Kognitio scripts, lua
Categories :Blog, Kognitio How To

Standard ANSI SQL, whilst a great language for database querying, doesn’t allow for commonly used features like looping or extensive variable usage. There are variants which support some of this e.g. Transact-SQL but not on Kognitio until the introduction of Kognitio scripts.

But 2 languages in one script? How does this work?

Kognitio scripts (also referred to as ‘Kog’ scripts) are based on the Lua language where SQL statements can be written together with Lua code for extra functionality. They allow typical procedural language operations without the need of extensive workarounds in a non-procedural language like SQL.

It can be a bit overwhelming for SQL only users so we’ll go over some useful tips for getting started, for more in depth reading refer to the console scripting guide on the forum.

To work with Kog scripts in Kognitio Console make sure that your scripting pane options tab has settings similar to those below. This will allow you to work effectively in debug mode.

 

Variable and function integration (for management)

Set variables via:

use_schema = ''
use_table = 'example'

To call these arguments in SQL just add a $ before the variable name:

create table $use_schema.$use_table(mycolumn varchar);

This is very useful if users share a generic script to be used on different data sets. Rather than changing the schema/table for every query in the script, managing all the variables at the top of the script makes changing the source data/script parameters set much more organised.

Note that if you were to use variables in a Lua only code then you don’t need to reference them with “$”. Here the variable “a” is used in a loop:

a = 0
for i=1,5 do
    a = a +1;
end

Functions work in a similar way:

function square_num(a)
return a*a;
end
select $square_num(value) from values between 1 and 5;

Remember that Lua is case sensitive where it’s typically written in lower case (“and” is a reserved word but “And” or “AND” could be used as variable names).

The variable name isn’t restricted to alphanumeric characters, however if it doesn’t begin with one then you’ll need to enclose with brackets e.g. “$(_myvar)”. You can also reference a Lua table this way “$(myluatable[1])” which refers to the first column.

When stepping through a Kog script in Kognitio Console, you can always view the currently defined variables via the script variables pane (view > panes > script variables).

Try: Managing variables can be extended further via local variables, try creating two functions which take a global variable as an argument and alter it to suit the function need. Further this by using the global table _G..

Saving results to Lua tables e.g. column names/table names

You may have used the feature “$Col1” in Kognitio’s SQL scripting to retrieve a result from a previously run query. However this limits you to only the first row of results. A solution to this is to save them to a Lua table.

Declaring a table and insert via:

mytable= {};
table.insert(mytable,'element_1');
print(mytable[1]);

To set predefined elements:

mytable {'element_1','element_2'}

To save results from an SQL query:

average_values = select avg(value), avg(value*value) from values between 1 and 10;

Example: Save column names for an SQL table/view into a Lua table

mycolumns = select seq,name 
from SYS.IPE_COLUMN
where table_id in (select table_id 
                   from SYS.IPE_VIEW
                   where schema_id in (select id from SYS.IPE_ALLSCHEMA
                                       where name = '')
                   and name = ''
                   )
order by 1;
print(mycolumns.rows[1].NAME);

This saves the column names from a specified SQL table into a Lua table and prints the name of the first column. Some return values for Lua tables taken from our console scripting guide:

Table Variable Value
numRows Integer, Number of rows in the resultset.
numCols Integer, Number of columns in the resultset.
Status Integer, 0 = OK
colNames Table
Rows Table

Try: You’ll notice that you can also access a Lua tables column names via the colNames table variable so you can save an SQL table into a Lua table and call the colNames variable to get them rather than fetching it from the SYS tables.

Note: Returning results of a SQL query to lua tables should be restricted to small result sets as these are stored in the memeory of your local system where Kognitio console is running.

Looping with Lua and SQL

A basic version is possible in ANSI SQL but becomes a bit limited as you’re only able to work with an incrementing defined counter. In Kog scripting you can perform bread and butter loops like the classic “for” and “while” loops. These let you set index limitations as well as have an index paired alongside a value and iterate over Lua defined constructs like arrays and tables.

for i=1,5 do
    ...    
end
while i <5 do
   ...    
end

The “pairs” notation to include an index:

for a,b in pairs() do
   ...    
end

To integrate this with SQL queries just add them in to the loop, remember to end each query with a semicolon.

for i=1,5 do
    select avg(value) from values between 1 and $i ;
end

Example: Writing SQL automatically via concatenation (when there’s lots to write)

In the previous tip we saw the we can saved the results of an SQL query to a Lua table, we can now iterate through them by calling the table name and adding a suffix for the rows. Let’s take the column names of a previous table for example.

mycolumns = select seq,name 
from SYS.IPE_COLUMN
where table_id in (
                   select table_id from SYS.IPE_VIEW
                   where schema_id in (select id from SYS.IPE_ALLSCHEMA
                                       where name = '')
                   and name = ''
                   )
order by 1
;

And apply it to our concatenation. The aim is to write the SQL syntax for every column to calculate the average and maximums whilst excluding the unique ID column from this.

myschema ='CHAK'
mytable = 'CONCAT_STRINGS_EG '

--Table and view columns are stored in different system tables, 
--toggle with flag below depending on data source
--1= table, 2= view
type_flag = 1
if type_flag==1 then
    table_type = "IPE_TABLE"
    selection = 'id'
elseif type_flag==2 then
    table_type = "IPE_VIEW"
    selection = 'table_id'
end

drop table $myschema.$mytable;
create table $myschema.$mytable 
(id int, sale_price decimal(5,2), amount_paid decimal(5,2), no_items int);

insert into $myschema.$mytable
select 
row_number() over(),value*(value mod 5), ceiling(value*(value mod 5)),generate_key() 
from values between 1 and 5;

mycolumns = select seq,name from "SYS"."IPE_COLUMN"
where table_id in (
    select $selection from "SYS".$table_type
    where schema_id in (select id from "SYS"."IPE_ALLSCHEMA"
                       where name = '$myschema')
    and name = '$mytable'
    )
order by 1
;

col_average = ''
col_maximum = ''
for x,y in pairs(mycolumns.rows) do
   --Skip x=1 because that is the ID and we don't want to calculate metrics for it
    if x==2 then
        col_average = 'avg('..y.NAME..') avg_'..y.NAME
        col_maximum = 'max('..y.NAME..') max_'..y.NAME
    elseif x>2 then
        col_average = col_average..',avg('..y.NAME..') avg_'..y.NAME
        col_maximum = col_maximum..',max('..y.NAME..') max_'..y.NAME
    end
end

select id, $col_average from $myschema.$mytable group by 1;
select id, $col_maximum from $myschema.$mytable group by 1;

This will concatenate each of the 3 column names so that you can insert into another SQL statement later via “$col_average” and “$col_maximum” saving the need to type out each column name especially if this code is required in different queries. Extremely useful when producing a generic version for different datasets as the column names don’t need to be redefined. You could even wrap the whole code in a “for” loop and repeat this for different tables or views.

Try: Note that x is the index and y is the value in the above. Since SQL columns are separated by a comma in the coding, we use the index x to determine the first column and not add a comma in front of it. Conversely you could change it so that it only adds a comma when the index is not the first one.

Kognitio modules (Section off code)

As you build up your code you may find that you have functions which can be useful elsewhere i.e. other scripts where you can just copy them across but this can make scripts unnecessarily verbose. Instead putting them in Kog modules allows you to put them into another file to be called up later when needed.
Firstly you’ll need to set the path to your Kog modules. The example below is for a local machine folder but this will also work for network drives allowing multiple users to contribute and access them.

_G.package.path= _G.package.path.. ";?.kog";

Then you can load any files in there them via “require” similar to the module/library/package loading in other languages:

 scr1 = require "Kog_Module_Script1"
 scr1.run()

The first line designates that module as scr1 and the second runs it. In our example we’re going to run two modules depending on the time of day: the average during the day and the sum at the end of the day.

Getting the times:

select time '09:00:00';
stime = Col1;
print(stime)
select time '17:00:00';
etime =  Col1;
print(etime)
ctime=os.date("%X");
print(ctime)

if ctime > stime and ctime < etime then
    mod1 = require ""
    mod1.run()
else if ctime > etime then
    mod2 = require ""
    mod2.run()
else print("Not ready")
end

Which opens the module:

local x = {}
x.name = "Average Script"
select avg(value) from values between 1 and 5;

or:

local x = {}
x.name = "Sum Script"
select sum(value) from values between 1 and 5;

depending on the time of day. The idea being that you can also define functions in there and build up libraries to call in future development rather than inserting copious amounts of code from past scripts and making it very verbose.

Try: Replace the data source in the SQL scripts with your own data set and then insert a function into each script, starting off by printing the time of day and building up the complexity.

Using Kognitio on Amazon Elastic Map/Reduce

12

Jan
2017
Posted By : Andy MacLean Comments are off
Kognitio on Amazon EMR

Using Kognitio on Amazon Elastic Map Reduce

Amazon’s Elastic Map/Reduce product provides Hadoop clusters in the cloud. We’ve had several requests for the Hadoop version of our product to work with EMR. As of release 8.1.50-rel161221 we have made the two products compatible so you can use EMR to run Kognitio clusters. This article will show you how to get Kognitio clusters up and running on EMR.

In order to run Kognitio on EMR you will need:

This article assumes some basic familiarity with Amazon’s environment and the EMR feature so if you’re new to Amazon you’ll probably want to experiment with it a little first before trying to create a large Kognitio cluster. I’m also assuming that you’re creating a brand new EMR cluster just for Kognitio. If you want to integrate Kognitio with an existing EMR cluster you will need to modify these instructions accordingly.

Getting ready to start

Before you start you’re going to need to decide how to structure the Hadoop cluster and how the Kognitio cluster will look on it. Amazon clusters consist of various groups of nodes – the ‘master node’, which runs Hadoop specific cluster master programs like the HDFS namenode and Yarn resource manager, the ‘Core’ group of nodes, which hold HDFS data and run Yarn containers and optional extra ‘Task’ groups, which run Yarn jobs but don’t hold HDFS data. When running on Hadoop, Kognitio runs as a Yarn application with one or more controlling ‘edge nodes’ that also act as gateways for clients. The Kognitio software itself only needs to be installed on the edge node(s) as the user running it, it gets transfered to other nodes as part of the Yarn task that runs it.

For most EMR clusters it makes sense to use the EMR master node as the Kognitio edge node so that’s how this example will work. There are other possible choices here – you can just use one of the cluster nodes, you can spin up a specific task group node to run it or you can just have an arbitrary EC2 node with the right security settings and client software installed. However the master node is already doing similar jobs and using it is the simplest way to get up and running. For the rest of the cluster it’s easiest to have no task groups and run the whole application on Core nodes, although using task groups does work if you need to do that.

Configuring the master node

The master node also needs to be configured so that it can be used as the controlling ‘edge node’ for creating and managing one or more Kognitio clusters. For this to work you need to create a user for the software to run as, set it up appropriately and install/configure the Kognitio software under that user. Specifically:

  • Create a ‘kodoop’ user
  • Create an HDFS home directory for it
  • Setup authentication keys for it
  • Unpack the kodoop.tar.gz and kodoop_extras.tar.gz tarballs into the user’s home directory
  • Configure slider so it can find the zookeeper cluster we installed
  • Configure the Kognitio software to make clusters that use compressed messages

You can do this with the following shell script:

#!/bin/bash

#change the s3 bucket for your site
S3BUCKET=s3://kognitio-development

sudo useradd -c "kodoop user" -d /home/kodoop -m kodoop
HADOOP_USER_NAME=hdfs hadoop fs -mkdir /user/kodoop
HADOOP_USER_NAME=hdfs hadoop fs -chown kodoop /user/kodoop
sudo cp -r ~ec2-user/.ssh ~kodoop
sudo chown -R kodoop ~kodoop/.ssh

aws s3 cp $S3BUCKET/kodoop.tar.gz /tmp
aws s3 cp $S3BUCKET/kodoop-extras.tar.gz /tmp

sudo su - kodoop <<EOF
tar -xzf /tmp/kodoop.tar.gz
tar -xzf /tmp/kodoop-extras.tar.gz
echo PATH=~/kodoop/bin:\\\$PATH >>~/.bashrc

hn=`hostname`
grep -v '<\/configuration>' kodoop/slider/conf/slider-client.xml >/tmp/slider-client.xml
cat <<XXX >>/tmp/slider-client.xml
  <property>
    <name>slider.zookeeper.quorum</name>
    <value>\$hn:2181</value>
  </property>
</configuration>
XXX
cp  kodoop/slider/conf/slider-client.xml  kodoop/slider/conf/slider-client.xml.orig
cp /tmp/slider-client.xml  kodoop/slider/conf/slider-client.xml

cat >kodoop/config/server_defaults.cfg <<XXX
[runtime parameters]
rs_messcomp=1    ## turn on message compression
XXX
EOF

This script creates the user first, then it pulls the tarballs from an s3 bucket called s3://kognitio-development (You’ll want to change that to be your own bucket’s name and upload the tarballs into it). It then switches to the kodoop user, extracts everything and configures slider. The slider configuration required is the location of the zookeeper server which was installed with the cluster. This will be on port 2181 of the master node and this is the information that goes into slider-client.xml.

The final part of the script defines the rs_messcomp=1 setting for Kognitio clusters created on the EMR instance. This setting enables message compression, which causes messages to get compressed (with the LZ4 compression algorithm) before being sent over a network. This setting is not normally used but we recommend it for Amazon because the network:cpu speed ratio is such that it results in a speedup.

You can transfer this script to the master node and run it as ec2-user once the cluster starts, but it’s a lot nicer to have this run automatically as part of the cluster startup. You can do this by transfering the script to S3 and putting it together in a directory with the tarballs (and editing the s3 bucket name in the script appropriately). You can then specify the script during cluster creation as a custom action to get it run automatically (see below).

Creating the EMR cluster

Go to the Amazon EMR service in the AWS web console and hit ‘create cluster’ to make a new EMR cluster. You will then need to use ‘go to advanced options’ because some of the settings you need are not in the quick options. Now you have 4 pages of cluster settings to go through in order to define your cluster. Once you’ve done this and created a working cluster you will be able to make more clusters by cloning and tweaking a previous one or by generating a command line and running it.

This section will talk you through the settings you need to get a Kognitio cluster running without really getting into the other settings available. The settings I don’t mention can be defined any way you like.

Software Selection and Steps

Choose ‘Amazon’ as the vendor, select the release you want (we’ve tested it with emr-5.2.1 at the time of writing). Kognitio only needs Hadoop and Zookeeper to be selected from the list of packages, although adding others which you may need to run alongside it won’t hurt.

In the ‘Edit software settings’ box you may find it useful to enter the following:

[{"classification":"core-site","properties":{"yarn.nodemanager.delete.debug-delay-sec":"3600"}}]

This instructs yarn to preserve container directories for 1 hour after a container exits, which is very useful if you need to do any debugging.

If you want to have the master node configured automatically as discussed above, you will need to add an additional step here to do that. You can add a step by setting the step type to ‘Custom JAR’ and clicking configure. The Jar Location field should be set to s3://elasticmapreduce/libs/script-runner/script-runner.jar (if you like you can do s3://<regionname>.elasticmapreduce/ to make this a local read) and the argument is the full s3 path for the script you uploaded to s3 in the section above (e.g. s3://kognitio-development/kog-masternode). The script will now run automatically on the masternode after startup and the cluster will come up with a ‘kodoop’ user created and ready to go.

Hardware Selection

In the hardware selection page you need to tell EMR how many nodes to use and which type of VM to use for them. Kognitio doesn’t put much load on the master node so this can be any instance type you like, the default m3.xlarge works well.

The Core nodes can generally be anything which has enough memory for your cluster and the right memory:CPU ratio for you. For optimal network performance you should use the largest of whatever node type instead of a larger number of smaller instances (so 3x r3.8xlarge instead of 6x r3.4xlarge for example). The r3.8xlarge or m4.16xlarge instance types are good choices. You will want to use more RAM than you have data because of the Hadoop overhead and the need for memory workspace for queries. A good rule of thumb is to have the total RAM of the nodes which will be used for the Kognitio cluster be between 1.5x and 2x the size of the raw data you want to load as memory images.

You won’t need any task groups for this setup.

General Cluster Settings and Security

In the ‘General Cluster Settings’ pane you will want to add a bootstrap action for your node. This is required because the AMI used by EMR needs to have a small amount of configuration done and some extra Linux packages installed in order for it to run Kognitio’s software. The best way to do this is to place a configuration script in an S3 bucket and define this as a ‘custom action’ boostrap action. The following script does everything you need:

#!/bin/bash

sudo yum -y install glibc.i686 zlib.i686 openssl.i686 ncurses-libs.i686
sudo mount /dev/shm -o remount,size=90%
sudo rpm -i --nodeps /var/aws/emr/packages/bigtop/hadoop/x86_64/hadoop-libhdfs-*

This script installs some extra Linux packages required by Kognitio. Then it remounts /dev/shm to allow shared memory segments to use up to 90% of RAM. This is necessary because Kognitio clusters use shared memory segments for nearly all of the RAM they use. The final step looks a bit unusual but Amazon doesn’t provide us with a simple way to do this. Kognitio requires libhdfs but Amazon doesn’t install it out of the box unless you install a component which uses this. Amazon runs the bootstrap action before the relevant repositories have been configured on the node so the RPM can’t be installed via yum. By the time we come to use libhdfs all the dependencies will be in place and everything will work.

Finally, the Kognitio server will be accessible from port 6550 on the master node so you may want to configure the security groups in ‘EC2 Security Groups’ to make this accessible externally.

Creating a Kognitio cluster

Once you have started up your cluster and created the kodoop user (either manually or automatically), you are ready to build a Kognitio cluster. You can ssh into the master node as ‘kodoop’ and run ‘kodoop’. This will invite you to accept the EULA and display some useful links for documentation, forum support, etc that you might need later. Finally you can run ‘kodoop testenv’ to validate that the environment is working properly.

Once this is working you can create a Kognitio cluster. You will create a number of Yarn containers with a size you specify. You will need to choose a container size, container vcore count and a number of containers that you want to use for the cluster. Normally you’ll want to use a single container per node which uses nearly all of the memory. You can list the nodes in your cluster on the master node like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -list
17/01/09 18:40:26 INFO client.RMProxy: Connecting to ResourceManager at ip-172-40-0-213.eu-west-1.compute.internal/172.40.0.213:8032
Total Nodes:3
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
ip-172-40-0-91.eu-west-1.compute.internal:8041          RUNNING ip-172-40-0-91.eu-west-1.compute.internal:8042                             1
ip-172-40-0-126.eu-west-1.compute.internal:8041         RUNNING ip-172-40-0-126.eu-west-1.compute.internal:8042                            2
ip-172-40-0-216.eu-west-1.compute.internal:8041         RUNNING ip-172-40-0-216.eu-west-1.compute.internal:8042                            1

Then for one of the nodes, you can find out the resource limits like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -status ip-172-40-0-91.eu-west-1.compute.internal:8041
17/01/09 18:42:07 INFO client.RMProxy: Connecting to ResourceManager at ip-172-40-0-213.eu-west-1.compute.internal/172.40.0.213:8032
Node Report : 
        Node-Id : ip-172-40-0-91.eu-west-1.compute.internal:8041
        Rack : /default-rack
        Node-State : RUNNING
        Node-Http-Address : ip-172-40-0-91.eu-west-1.compute.internal:8042
        Last-Health-Update : Mon 09/Jan/17 06:41:43:741UTC
        Health-Report : 
        Containers : 0
        Memory-Used : 0MB
        Memory-Capacity : 253952MB
        CPU-Used : 0 vcores
        CPU-Capacity : 128 vcores
        Node-Labels :

The ‘Memory-Capacity’ field here shows the maximum container size you can create and CPU-Capacity shows the largest number of vcores. In addition to the Kognitio containers, the cluster also needs to be able to create a 2048MB application management container with 1 vcore. If you set the container memory size to be equal to the capacity and put one container on each node then there won’t be any space for the management container. For this reason you should subtract 1 from the vcore count and 2048 from the memory capacity.

You will also need to choose a name for the cluster which must be 12 characters or less and can only contain lower case letters, numbers and an underscore. Assuming we call it ‘cluster1’ we would then create a Kognitio cluster on the above example cluster like this:

CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1

This will display the following and invite you to confirm or cancel the operation:

[kodoop@ip-172-40-0-213 ~]$ CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1
Kognitio Analytical Platform software for Hadoop ver80150rel170105.
(c)Copyright Kognitio Ltd 2001-2017.

Creating Kognitio cluster with ID cluster1
=================================================================
Cluster configuration for cluster1
Containers:               3
Container memsize:        251904 Mb
Container vcores:         127

Internal storage limit:   100 Gb per store
Internal store count:     3

External gateway port:    6550

Kognitio server version:  ver80150rel170105

Cluster will use 738 Gb of ram.
Cluster will use  up to 300 Gb of HDFS storage for internal data.

Data networks:            all
Management networks:      all
Edge to cluster networks: all
Using broadcast packets:  no
=================================================================
Hit ctrl-c to abort or enter to continue

If this looks OK, hit enter and the cluster will be created. Once creation is completed you will have a working Kognitio server up and running and ready to use.

Next steps

At this point you should have a working Kognitio cluster up and ready to use. If you’re already a Kognitio user you probably know what you want to do next and you can stop reading here. This section is intended as a very brief quickstart guide to give new users an idea of the most common next steps. This is very brief and doesn’t cover all the things you can do. Full documentation for the features discussed below is available from our website.

You can download the Kognitio client tools from www.kognitio.com, install them somewhere, run Kognitio console and connect to port 6550 on the master node to start working with the server. Alternatively you can just log into the master node as kodoop and run ‘kodoop sql <system ID>’ to issue sql locally. Log in as ‘sys’ with the system ID as the password (it is a good idea to change this!).

There are now lots of different ways you can set up your server and get data into it but the most common thing to do is to build memory images (typically view images) to run SQL against. This is typically a two step process involving the creation of external tables which pull external data directly into the cluster followed by the creation of view images on top of these to pull data directly from the external source into a memory image. In some cases you may also want to create one or more regular tables and load data into them using wxloader or another data loading tool, in which case Kognitio will store a binary representation of the data in the HDFS filesystem.

Connecting to data in HDFS

Kognitio on Hadoop starts with a connector called ‘HDFS’ which is configured to pull data from the local HDFS filesystem. You create external tables which pull data from this either in Kognitio console or via SQL. To create external tables using console you can open the ‘External data sources’ part of the object tree and expand ‘HDFS’. This will allow you to browse the object tree from console and you’ll be able to create external tables by right clicking on HDFS files and using the external table creation wizard.

To create an external table directly from SQL you can use a syntax like this:

create external table name (<column list>) from HDFS target 'file /path/to/csv/files/with/wildcards';

Kognito is able to connect to a variety of different data sources and file formats in this manner. See the documentation for full details. As a quick example we can connect to a 6 column CSV file called test.csv like this:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from HDFS target 'file /path/to/file/test.csv';

If instead it is a directory full of csv files we can use ‘/path/to/file/test/*.csv’ instead to use them all as a single table in Kognitio.

Connecting to data in Amazon S3

Kognitio can also pull data directly out of Amazon S3. The Amazon connector is not loaded by default and it isn’t able to use the IAM credentials associated with the EMR nodes so you need to get a set of AWS credentials and configure your server with the following SQL:

create module aws;
alter module aws set mode active;
create group grp_aws;

create connector aws source s3 target 
'
accesskey YOUR_ACCESS_KEY
secretkey "YOUR_SECRET_KEY"
max_connectors_per_node 5
bucket your-bucket-name
';

grant connect on connector aws to grp_aws;
;

This sql loads the Kognitio Amazon plugin, creates a security group to allow access to it and then creates an external table connector which uses the plugin. You will need to give the connector some Amazon credentials where it says YOUR_ACCESS_KEY and YOUR_SECRET_KEY and you will need to point it at a particular storage bucket. If you want to have multiple storage buckets or use multiple sets of credentials then create multiple connectors and grant permission on different ones to appropriate sets of users. Granting the ‘connect’ permission on a connector allows users to make external tables through it. In this case you can just add them to the group grp_aws which has this.

max_connectors_per_node is needed here because the amazon connector gives out of memory errors if you try to run too many instances of it in parallel on each node.

Now an external table can be created in exactly the same way as in the HDFS example. If my amazon bucket contains a file called test.csv with 6 int columns in it I can say:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from AWS target 'file test.csv';

Creating memory images

Once you have external tables defined your server is ready to start running queries, but each time you query an object the server will go out to the remote data and pull it into the server. Kognitio is capable of running like this but most people prefer to create memory images and query those instead because this allows data to be queried very fast. There are several different kinds of memory image in Kognitio but the most commonly used images are view images. With a view image the user defines a view in the normal SQL way and then they image it, which makes an in-memory snapshot of the query. This can be done with this SQL:

create view testv as select * from test;
create view image testv;

So testv is now a memory image. Images can be created with various different memory distributions which tell the server which nodes will store which rows. The most common of these are:

  • Hashed — A hash function on some of the columns determines which nodes get which rows
  • Replicated — Every row goes to every ram processing task
  • Random — Just put the rows anywhere. This what we will get in the example above.

The various memory distributions can be used to help optimise queries. The server will move rows about automatically if they aren’t distributed correctly but placing rows so they are co-located with certain other rows can improve performance. As a general rule:

  • Small tables (under 100M in size) work best replicated
  • For everything else hash on the primary key except
  • For the biggest images which join to non-replicated tables hash on the foreign key to the biggest of the foreign tables
  • Use random if it isn’t obvious what else to use

And the syntax for these is:

create view image test replicated;
create view image test hashed(column, column, column);
create view image test random;

Imaging a view which queries one or more external tables will pull data from the external table connector straight into RAM without needing to put any of it in the Kognitio internal storage. Once the images are built you are ready to start running SQL queries against them.

Using external scripts to create a table with random test data

02

Dec
2016
Posted By : admin Comments are off
external scripts creating tables random test data
Categories :Blog, Kognitio How To

As a database tester, sometimes there’s a need to create tables with randomised data for testing. This could be because there are security and privacy issues preventing the use of production data, or it could help in testing for robustness against unusual characters, integer ranges and so on.  There are various ways to achieve this. One flexible method is to use Kognitio’s external scripting feature.

External scripts are Kognitio’s way of passing data to and from other Linux environments such as Python, Bash and R, described in section 10 of the Kognitio Guide (you can download the documentation from http://www.kognitio.com/forums/viewtopic.php?f=2&t=3/).

If external scripting for Python is already enabled in a Kognitio system, then the following SQL will create a table with randomised data. The SQL and embedded Python code can be modified to specify the number and type of columns, the number of rows, and any desired ranges or characteristics for the data.


create table t4 as select * from
(external script using environment python
sends (a int, b varchar, c date, d timestamp) -- these fields must be consistent with the fields in each row returned by Python
limit 1 threads -- remove this setting to produce larger amounts of data, but be aware of how many threads will run and how the data size will multiply!
script '
import csv, sys, random, string, time
csvwriter = csv.writer(sys.stdout)
for i in range(100): # we want 100 rows of random data
          row = []
          # first field should be an integer between 100 and 1000 (or bigint, tinyint):
          row.append( random.randrange( 100, 1000 ) )
          # then a string field of length 20(varchar, char, etc):
          row.append( "".join(random.SystemRandom().choice(string.ascii_uppercase + string.digits) for _ in range(20)))
          # a date field:
row.append(time.strftime("%Y-%m-%d", time.localtime(random.randrange(1000000000, 1150000000))))
          # and a timestamp field:
row.append(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(random.randrange(1000000000, 1150000000))))
          # send the finished row back to Kognitio:
csvwriter.writerow(row)
'
) fd

 

 

The above SQL produces a table with rows like:

468, 6O6SZVX2NQVWT0VYV69O, 2004-03-09, 2004-10-29, 17:28:35.000000

In the event of errors, it may be helpful to review section 10.4 of the Kognitio Guide, or contact Kognitio support.

Providing good problem reports

21

Nov
2016
Posted By : admin Comments are off
210716 - Image credit: iStockphoto/cifotart
Categories :Blog, Kognitio How To

Why bother creating a good problem report?

The better the problem report is, the faster your problem is likely to be solved.

In the absence of a good report, whoever is trying to resolve your problem will probably come back to you and ask for more information. The more iterations there are of this nature, the more time will elapse before the problem is understood, and can be resolved.

 

Where to create problem reports

All users of Kognitio software can report problems by posting on the community forums at http://kognitio.com/forums – remember these are visible to everyone, so don’t post confidential information.

Alternatively, customers paying for support can:

  • create a case on the support portal
  • email the appropriate helpdesk alias at Kognitio
  • phone the Kognitio helpdesk for critical problems – this is the best method outside normal working hours, as other methods are not monitored in these times. Note that customers need to have a 24×7 support package to call outside normal working hours

 

How to create good problem reports

There are many articles on the internet about how to create good bug reports, such as:

The same general rules apply for reporting issues to Kognitio.

 

So what are the basic rules for a problem report?

  1. Specify details of the Kognitio system concerned:
    • the system_id if you are a paying customer
    • the specs of the system otherwise
  2. Clarify when the problem started
    • this allows us to review logs for changes that have occurred leading up to that point in time
  3. Explain which queries are affected
    • is it all queries, or just some queries
    • if just some queries, what do that have in common – are they all by a particular user, or do they all try to put images into RAM, or do they all reference a common object?
  4. If possible, provide evidence of when things were last OK, and information on any changes between then and when the problem started. For example:
    • Kognitio software was upgraded
    • Some/all nodes were rebooted
    • OS patches were applied
    • Data feed changes occurred
    • Scheduling of jobs was changed
  5. Provide any other useful information available.
    • For performance issues, this could be timings before and after the problem occurred.
    • Query plans for problem queries, if available (precede the query with the keyword DIAGNOSE to capture this)
    • Any extra information you have captured during your own investigation

Once the above rules have been followed, generic advice from the earlier references on bug reporting comes into play:

  1. provide a good title/subject
  2. if you have precise steps to reproduce the problem, please include them
  3. provide any relevant additional information, such as DIAGNOSE output, log files, screen shots, …
  4. if it isn’t clear from the above, provide details of the expected behaviour, and how that differs from the behaviour you are seeing.
  5. file a separate report for each issue – do not be tempted to put a big list of issues in one ticket, things work better with one issue per ticket

Note that users of the support portal can use the “FAQ: Kognitio troubleshooting” solution which appears on the portal home page, and click through the decision tree to get assistance with a variety of problems.

Finally, let’s look at a bad problem report, followed by a good version of that same report:

 

BAD PROBLEM REPORT

Our system is giving errors. Please help. This is critical.

 

GOOD PROBLEM REPORT

Our production system, bigkog01, is giving “PM0007: PMA out of memory” errors when we run our imaging script.

We are running 8.1.0rel160921 software, and have been for over a month.

The errors started when we ran our usual imaging process on Friday night – it ran successfully on the previous Friday, and the script is unchanged.

We’ve attached the script, which basically drops existing images for our key objects, then recreates them – it works like this so user’s see a consistent view of the data throughout the week, so they can rerun analysis without being confused by updates to the underlying data during the week.

We have noticed RAM usage creeping up throughout the week as users instantiate more objects, and we have a separate case open at the moment about how we should tidy up those user objects rather than just let them keep consuming more and more of our system RAM.

We’ve run the $f7 query from submit to show RAM usage by object, and attached the output, along with the output from $f2 which shows RAM usage is very high!

We’ve searched for “kognitio pm0007” (without the quotes) and found the following two community forum articles, but neither of them seem to address our situation:

 

Facebook

Twitter

LinkedId