A Review of DataWorks Summit, San Jose 2017


Posted By : Mark Chopping 0 Comment

The DataWorks Summit in San Jose was held on June 13-15, and this blog post summarises interesting talks at the event.

Keynote section

Sumeet Singh (Yahoo)

Sumeet talked about Yahoo’s migration from MapReduce jobs to those running on Tez on the 39K+ nodes that they use for Hadoop processing with over 1M jobs per day. In the last year, MapReduce jobs have dropped from about 70% of the workload to around 10%, with Tez moving in the opposite direction (Spark job level remaining consistent). This has also allowed improved utilisation of the compute and storage infrastructure.

In addition, updating the version of Hive in use has led to a significant lowering of latency for analytics jobs. The following slide shows how the most common query runtime is now in the 5-30 second range (for about 1.2 million queries per month, out of a total of 2 million per month), although you can see how this increases with the number of records involved – on the right-hand side of the chart are the jobs which take over 5 minutes as the average number of records involved rises to around 15 billion.

Girish Mundada (HPE)

Girish had previously been a Postgres developer, and highlighted a number of lessons learned from this career:

  • databases never finish (in terms of development, not individual query runtime)
  • there a multiple ways to solve the same problem
  • use the right set of tools for the job (which seems relevant given the number of SQL on Hadoop alternatives that exist, and the possibility of deploying multiple of these solutions on your Hadoop cluster)

The real crux of his talk was explaining the complexity of Hadoop for many companies, and hence the benefit of using HPE or some other vendor (or indeed cloud infrastructure company) to simplify the process for these companies.

Hadoop Query Performance Smackdown (Michael Fagan, Comcast)

Michael and his co-speaker talked about their benchmarking of SQL on Hadoop solutions, using 66 of the TPC-DS queries and a variety of file formats.

The platform used was 11 worker nodes with 128GB RAM and 32 cores each, plus 5 master nodes with 90GB  RAM and 32 cores each. The system had HDP 2.6 installed.

They chose to impose a 10 minute penalty for any failing queries, and all the engines used in their test had failures (from 1 for Hive LLAP, to 8 for Hive on MapReduce). They had issues with the Spark Thrift Server which led to very inconsistent timings for SparkSQL and long garbage-collection pauses – their feedback on this was to wait for improvements rather than rule this out permanently based on current performance.

From their timings, LLAP came out best, just ahead of Presto for the SQL engines (the latter having issues with date-related casting which was a big part of its 5 failing queries). Of the 66 queries, LLAP was fastest for 44, Presto for 16, and Tez for 6. They viewed LLAP and Presto as solid, with no major issues in their 3 months of testing.

On file formats, Bzip compressed text and sequence files performed the worst, which was a caveat against just dumping existing data into Hadoop using these formats. ORC Zlib was the benchmark winner, just ahead of Parquet.

In response to questions from the audience, the lack of concurrency in the tests was mentioned, as was the subset of TPC-DS queries run. Kognitio’s own benchmarking using TPC-DS queries did use concurrency, and did use all the queries – more information can be found at http://go.kognitio.com/hubfs/Whitepapers/sql-on-hadoop-benchmarks-wp.pdf

Tez Shuffle Handler: Shuffling At Scale with Apache Hadoop (John Eagles, Yahoo)

John used a Yahoo quote early in his talk, “When you run at scale, there are no corner cases”. He then discussed a number of rare performance issues seen as Yahoo increased adoption of Tez internally, where the sophistication in Tez had outgrown the MapReduce shuffle handler.

For example, the slide on the right shows a problem with auto-reduce when an operation was expected to use 999 reducers but ended up with just 1. This ended up having to retrieve data from 999 partitions on each of 4300 mappers, for a meagre total of 450MB. Due to the number of operations, this shuffle took 20 minutes. So Yahoo introduced a composite or ranged fetch to allow multiple partitions to be retrieved in one operation, reducing the shuffle time to 60 seconds.

Similar issues were also seen with e.g. an auto-reduce merge – this time the composite fetch only sped up the operation from 50 minutes to 20 minutes as the real problem was an inefficiency in the way merges with a massive number of inputs (17 million for the query in question) were handled, and fixing this reduced the shuffle time to 90 seconds.

ORC File – Optimizing Your Big Data (Owen O’Malley, Hortonworks)

Owen discussed the importance of stripe size with the default being 64MB. The setting is a trade-off with larger stripes giving large, more efficient reads, but smaller stripes requiring less memory and giving more granular processing splits. When writing multiple files concurrently the strip size is automatically shrunk, but sorting dynamic partitions means only one writer is active at a time.

He also covered HDFS block padding settings to align stripes with HDFS blocks, which gives a performance win at the cost of some storage inefficiency.

Predicate push down was covered, allowing parts of files to be skipped which cannot contain valid rows. ORC indexes at the file, stripe and row group (10K rows) level, to allow push down at various granularities.

Sorting data within a file rather than creating lots of partitions allows row pruning, and bloom filters can be used to improve scan performance. Again, there is a trade-off between the space used for these filters and their efficiency, which can be controlled via a parameter. A good example of row pruning occurs in TPC-DS with a literal predicate on the line item table – using no filters, 6M rows are read, with just the min/max metadata this is reduced to 540K rows, and with bloom filters it drops to 10K rows.

Column encryption is supported, allowing some columns of a file to be encrypted (both data and index). The user can specify how data is treated when the user does not have access (nullify / redact / SHA256).

An Overview On Optimization In Apache Hive: Past, Present, Future (Hortonworks)

This talk mentioned the use of multiple execution engines (Tez, Spark), vectorized query execution to integrate with columnar storage (ORC, Parquet), LLAP for low latency queries.

It covered the need for a query optimizer, and the challenge between plan generation latency and optimality. From 0.14.0, Hive has used Calcite for its logical optimiser, and gradually shifted logic from Hive to Calcite. The slide on the right shows some of the optimizer improvements made in this period.

For logical optimization there are rule-based and cost-based phases, with over 40 different rewriting rules (including pushdown filter predicates, pushdown project expressions, inference or new filter predicates, expression simplification, …). The rules also allow queries Hive could not otherwise execute to be transformed into an executable representation – e.g. queries with INTERSECT, EXCEPT, … will be rewritten to use JOIN, GROUP BY, …

Calcite’s join reordering also allows bush query plan to be generated (e.g. join table A and B, then C and D, then join the results together, rather than just adding a table to the already-joined results each time).

Work is in progress on materialized view support, and future plans include collecting column statistics automatically, making better estimates of number of distinct values, and speeding up compilation. There should be an update on Hive performance on the Hortonworks blog in the next few weeks.

Running A Container Cloud On Yarn (Hortonworks)

Hortonworks builds, tests and releases open source software. As such, it does dozens of releases a year, with tens of thousands of tests per release across over a dozen Linux versions and multiple back-end databases. Therefore, they are looking to reduce overhead, and achieve greater density and improved hardware utilization.

Using a container cloud eliminates the bulk of virtualization overhead, improving density per node. Containers also help reduce image variance through composition. Startup time is fast, as there is no real boot sequence to run.

The building blocks for this cloud are:

  • YARN Container Runtimes – enable additional container types to make it easier to onboard new applications/services.
  • YARN Service Discovery – allow services running on YARN to easily discover one another.
  • YARN Native Services – enable long running YARN services.

For service discovery, the YARN Service Registry allows applications to register themselves, allowing discovery by other applications. Entries are stored in Zookeeper. The registry entries are exposed via the YARN DNS server, which watches the registry for changes and creates the corresponding DNS entry at the service level and container level.


The Columnar Roadmap: Apache Parquet and Apache Arrow (Julien Le Dem, Dremio)

Parquet and Arrow provide columnar storage for on-disk and in-memory respectively. The former has a focus on reading, with the expectation that data will be written once and read many times, whereas the latter is often for transient data and aims for maximisation of CPU throughput via efficient use of CPU pipelining, SIMD, and cache locality (which columnar structures support given that all the values for a given column are adjacent rather than interleaved with other columns).

The trade-off can be seen with e.g. Arrow having data in fixed positions rather than saving the space for NULL values, which gives better CPU throughput at the cost of some storage overhead.

The goal is for projects to adopt e.g. Arrow as a hub for interoperability, removing the need for duplicating functionality in many different projects, and also avoiding CPU costs of serialising/deserialising when moving data between different projects.

Exhibition Hall

As usual, a lot of people came to the Kognitio stand to talk about their practical problems in running SQL on Hadoop. Typically these revolve around connecting a lot of business users with their favourite tool (e.g. Tableau, MicroStrategy) to data stored in Hadoop. With a small number of users they tend to see issues, and they are then looking for a product which can give query performance on Hadoop with much higher levels of concurrency.

As mentioned in the query performance smackdown commentary above, this whitepaper has good information on benchmarking Impala, SparkSQL and Kognitio for industry-standard queries, including running with concurrency rather than a single stream, so if you read this post and have an SQL on Hadoop issue, that is a good reference point to start with.

A Review of Strata Data Conference, London 2017


Posted By : Mark Chopping Comments are off
Tags :  

The Strata Data Conference was held at the ExCeL in London this week. Sessions that were of interest to me included:

What Kaggle has learned from almost a million data scientists (Anthony Goldbloom, Kaggle)

This was part of the keynote on the first day. Kaggle is a platform for data science competitions, and have had almost 1 million users participate. Over 4 million models have been submitted to competitions, and this presentation covered off the traits Kaggle have seen for successful entrants.

In particular, for structured data the trend is for competitors to initially explore the data via histograms etc. to get a better understanding of it, then create and select features for use in their approach, which typically involves a classifier. The work on features is more important than the choice of classifier, and successful competitors are usually very creating in choosing features (e.g. car colour type for predicting car resale value), and persistent as most intuitions around feature selection/creation prove to have little correlation with the end goal. Finally, the best competitors tend to use version control for their models, to make it easier to track the success/failure of each approach.

Creating real-time, data-centric applications with Impala and Kudu (Marcel Kornacker, Cloudera)

The room was packed for this session as Marcel gave some background on Kudu (a relational store that can be used as an alternative to HDFS) and Impala. He explained that Kudu avoided the limitations on delete, update and streaming inserts  seen with HDFS, and the poor full table scan performance of HBase. As Kudu does not use HDFS for persistence, it handles its own replication, although this means it can’t e.g. benefit from the reduced storage overhead planned for HDFS in Hadoop 3. One workaround would be to only keep e.g. the latest 12 months worth of data in Kudu, and push older data into HDFS to benefit from its reduced storage overhead when that is available.

Kudu has been tested up to 275 nodes in a 3PB cluster, and internally uses columnar format when writing to disk, having collected records in RAM prior to this transposition. It allows range and hash partitioning to be combined. For example, you could use range partitioning by date, but then hash within a date to keep some level of parallelism when dealing with data for one particular date. Currently it only supports single-row transactions but the roadmap includes support for multi-row. From the examples given it appears there are some local predicates it cannot handle (e.g. LIKE with a regular expression), and batch inserts are reportedly slow. Multi-versioning is used as with Kognitio and many other products.

Impala can use all three of those storage options (Kudu, HDFS, HBase), has over 1 million downloads, and is reportedly in use at 60% of Cloudera’s customers.

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop (Marcel Kornacker, Cloudera) 
Marcel started by going through some performance benchmarks involving Impala, and highlighted the importance of moving beyond single user benchmarks.
He then moved onto some techniques for improving Impala performance including:
Partitioning: partitions can be eliminated by join lookup to generate run-time filters (what Kognitio used to call spider joins) – so if joining fact and date tables on a date key and having a local predicate on the data table, then that predicate can be used to generate a list of relevant date keys, and that filter can be applied to the fact table before the join. This appeared to be the biggest factor in Impala’s TPC-DS performance improvements in early 2016. Marcel advised regularly compacting tables to keep file and partition sizes as small as possible, and gave general advice to stick with less than 20,000 partitions (too few and you don’t eliminate enough data with filters, too many and you lose parallelism and put extra load on name node etc.). As in the example above, partition on join keys to get benefit from run-time filters.
Sorting: this will be added in the next release. It is particularly useful as Parquet can store stats on e.g. min and max values within a page, so sorting can help eliminate some of those pages when there are too many column values for partitioning.
Use appropriate data types: some operations are a lot more expensive with different data types (e.g. strings), so try to avoid using these expensive data types. Kognitio used to offer similar advice to customers before modifying their product to make operations like string aggregation as efficient as integer aggregation.
Complex schemas: parent-child relationships with nested collections offer physical colocation of data, giving a natural optimization. Need to use columnar storage for this approach as resulting tables are wide.
Statistics: it takes a long time to collect these, so customers often ask if they can get the same effect with using e.g. optimiser hints to determine the order of joins. That is not the case, as statistics are used for far more than determining join ordering – e.g. scan predicates are order by selectivity and cost, the selectivity of scans is computed, join sides need to be determined for efficiency, join type needs to be decided, a decision on whether to apply run-time filters needs to be made (as presumably the cost of generating and applying these can be significant). The ability to collect statistics on samples is being added, which would speed up the stats collections.
A deep dive into Spark SQL’s Catalyst optimizer (Herman van Hovell tot Westerflier, Databricks)
In an early slide entitled “Why structure” Herman showed the performance benefit of using SQL for a simple aggregation tasks rather than working directly on RDDs with code. He then outlined the approach for query optimization used by Catalyst, from ensuring the query was valid syntactically and semantically, generating a logical plan, optimizing that plan, then generating physical plans which have their cost compared until a final physical plan is selected.
He discussed the use of partial functions to specify transformations of plans (e.g. constant folding), and then showed how it was possible to write your own planner rules to be incorporated into optimization.
It wasn’t clear to me how the optimizer deals with having a vast number of potential rules to apply, with some being inapplicable at one stage in optimization, but then being valid later on in the process after other rules have been applied.
 Journey to AWS: Straddling two worlds (Calum Murray, Intuit)
A very interesting talk on Intuit’s motivation for and subsequent execution of a migration from on-premise software to the cloud (Amazon, in their case).
The  key takeaways were:
  • use a tip-toe approach rather than big-bang. Be in a position where you can flip back and forth from cloud for an application, then swap to cloud initially for a small number of minutes, gradually increasing to 24 hours or more.
  • swim-lane applications first, if possible, to allow for this approach (so you are only exposing a small subset of users to the change initially).
  • consider security implications – in their case, with over 10 million customers, they had to put extra encryption in place, use different encryption keys for on-premise and cloud, etc.

You can find speaker slides and videos from the conference at https://conferences.oreilly.com/strata/strata-eu/public/schedule/proceedings

Exhibition Hall

Conversations at the Kognitio stand in the exhibition hall reflected an increasing interest in how to address SQL on Hadoop problems with company’s installed Hadoop systems. This was a marked step forward from last year’s conference where a lot of the conversations were with companies that were just starting out with Hadoop, and hadn’t done enough with Hive/Impala/Spark SQL to have encountered significant problems that they needed to solve.

Hadoop’s biggest problem, and how to fix it


Posted By : Mark Chopping Comments are off
Tags :  ,


Hadoop was seen as a silver bullet for many companies, but recently there has been an increase in critical headlines like:

  1. Hadoop Has Failed Us, Tech Experts Say
  2. You’re doing Hadoop and Spark wrong, and they will probably fail
  3. Has Hadoop Failed? That’s the Wrong Question

The problem

Dig behind the headlines, and a major issue is the inability for users to query data in Hadoop in the manner they are used to with commercial database products.

From the Datanami article:

  • Hadoop’s strengths lie in serving as a cheap storage repository and for processing ETL batch workloads, Johnson says. But it’s ill-suited for running interactive, user-facing applications
  • It’s better than a data warehouse in that have all the raw data there, but it’s a lot worse in that it’s so slow
  • “At the Hive layer, it’s kind of OK. But people think they’re going to use Hadoop for data warehouse…are pretty surprised that this hot new technology is 10x slower that what they’re using before,” Johnson says. “[Kudo, Impala, and Presto] are much better than Hive. But they are still pretty far behind where people would like them to be.”

The Register article based on a Gartner research talk recognises Hadoop’s strength for ETL processing, but highlights the issues with SQL-handling on Hadoop.

The Podium Data article states “Hadoop is terrible as a relational database”, and “Hadoop failed only in the sense that inflated expectations could never be met compared to mature commercial offerings.”

“The Growing Need for SQL for Hadoop” talks about the need for SQL for Hadoop. The ideal is to be “on Hadoop”, and thus processing data within the Hadoop cluster, rather than “off Hadoop” where data has to be extracted from Hadoop for processing.

Similarly, Rick van der Lans talks about “What Do You Mean, SQL Can’t Do Big Data?”, emphasising the need for SQL solutions when working with big data platforms.

RCA of the problem

There can be many reasons for current SQL-on-hadoop products not being performant.

Possibilities include:

  • overhead of starting and stopping processes for interactive workloads – to run relatively simple queries quickly, you need to reduce latency. If you have a lot of overhead for starting and stopping containers to run tasks, that is a big impediment to interactive usage, even if the actual processing is very efficient
  • product immaturity – a lot of commercial databases have built on the shoulders of giants. For example, this wiki lists a set of products that derive from PostgreSQL, including Greenplum, Netezza, ParAccel, Redshift, Vertica. This gives these products a great start in avoiding a lot of mistakes made in the past, particularly in areas such as SQL optimisation. In contrast, most of the SQL-on-hadoop products are built from scratch, and so developers have to learn and solve problems that were long-since addressed in commercial database products. That is why we see great projects like Presto only starting to add a cost-based optimiser now, and Impala not being able to handle a significant number of TPC-DS queries (which is why Impala TPC-DS benchmarks tend to show less than 80 queries, rather than the full 99 from the query set).
  • evolution from batch processing – if a product like Hive starts off based on Map-Reduce, its developers won’t start working on incremental improvements to latency, as they won’t have any effect. Similarly, if Hive is then adopted for a lot of batch processing, there is less incentive to work on reducing latency. Hive 2 with LLAP project aims to improve matters in this area, but in benchmarks such as this AtScale one reported by Datanami it still lags behind Impala and SparkSQL.


Whilst benchmarks show that SQL on Hadoop solutions like Hive, Impala and SparkSQL are all continually improving, they still cannot provide the performance that business users need.

Kognitio have an SQL engine originally developed for standalone clusters of commodity servers, and used by a host of enterprise companies. Due to this heritage, the software has a proven history of working effectively with tools like Tableau and MicroStrategy, and delivering leading SQL performance with concurrent query workloads – just the sort of problems that people are currently trying to address with data in Hadoop. The Kognitio SQL engine has been migrated to Hadoop, and could be the solution a lot of users of Hive, Impala and SparkSQL need today.

It has the following attributes:

  • free to use with no limits on scalability, functionality, or duration of use
  • mature in terms of query optimisation and functionality
  • performant, particularly with concurrent SQL query workloads
  • can be used both on-premise and in the cloud

For further information about Kognitio On Hadoop, try:


This post first appeared on LinkedIn on March 23, 2017.

Playing Around With Kognitio On Hadoop: Hashes Cannot Always Anonymise Data


Posted By : Chris Barratt Comments are off
kognitio data scripts how to

In this brave new world of big data one has to be careful of the loss of privacy with the public release of large data sets. Recently I reread “On Taxis And and Rainbows” that presents an account of how the author came to de-anonymise a data set made available by New York City’s Taxi and Limousine Commission. This data set included logs of each journey made: locations and times of pickups and drop-offs along with supposedly anonymised medallion numbers and other data. The author came to identify the medallion number data as the MD5 check-sums of the medallion numbers. They were then able to compute a table mapping 22 million possible medallion numbers to their MD5 check-sums. They reported that this took less than 2 minutes. This table allowed them to start ‘de-anonymising’ the data.

The article identified that this was possible because of the nature of the data. The set of all possible medallion numbers was known and small enough for a table mapping from the data for check-sums to be generated for each value in a feasible amount time.

Computing a large number of check-sums should be easy for a Kognitio system to handle. As several check-sum functions are implemented as SQL functions in the Kognitio system, I wondered how quickly Kognitio could generate a table mapping some potentially personally identifying information to check-sums using the functions.

As every UK address has a postcode I thought that postcodes would form a good candidate for a piece of personally identifying information. They are roughly equivalent to US ZIP codes. It seems plausible to me that someone might attempt anonymise postcodes using check-sums. So I thought I would try using Kognitio to compute tables of postcodes and their check-sums to see how quickly such an anonymisation strategy might be broken.

I found an almost complete list of valid postcodes in the “Code-Point Open” data set. Handily, this is freely available from the Ordnance Survey under the Open Government Licence. The data set contains about 1.7 million records each with a postcode field and other fields for related geographical data. The field titles are:


The data set is provided as a zip archive containing 120 CSV files with a total uncompressed size of 154MiB.

Getting The Kognitio System Check-sum Functions Timed

For the check-summing task I created a new Kognitio on Hadoop system (the ‘Kognitio system’ from here on) to run on a small developmental Hadoop cluster that I have access to. I copied the CSV files into a new HDFS directory on the cluster running the Kognitio system.

For all the queries/SQL commands I used the ‘sys’ superuser of the Kognitio system. I was the only one who was going to use the Kognitio system so there was no need to protect data from anyone but myself and I could avoid setting up unneeded users and privileges for anything in the system.

I created an external table to give the Kognitio system access to the CSV data on HDFS. This kind of table can’t be written to but can be queried like any other ‘regular’ table on the Kognitio system. Then I created a view on the external table to project only the postcodes. I followed this by creating a view image on the view to pin the postcode only rows of the view in memory. It is faster to access rows in memory than anywhere else.

Then I created several ram-only tables to store postcodes with their check-sums. Ram-only tables as the name suggests do not store row data on disk and so using them avoids any slow disk operation overheads.

Finally I ran and timed a series of insert-select queries that computed the check-sums for our postcodes and inserted them into the ram-only tables.

Some Background On External Tables, Connectors and Plugin Modules

Kognitio’s external tables present data from external sources as a set of records of fields using a connector. The list of fields and the connector are specified by the create external table statement.

The connector in turn specifies the data provider to use along with any connector specific configuration parameter settings required. The data provider implements the interface to the external data source for the internals of the Kognitio system. Data providers are also known as connectors but I wanted to avoid potential confusion from shared names.

The data provider that I used to access the HDFS data is implemented as a plugin connector provided by a plugin module. Plugin Modules are how Kognitio system implements some features that are not in the core code. Loading and activating the plugin modules is required to make their features available to the Kognitio system. Plugin modules can also present parameters used for any module specific configuration required.

Creating The External Table: Making the Code-Point Open Data Accessible to the Kognitio System

The external table that I needed to access the data on the HDFS system required a connector that used the HDFS data provider. This is implemented as a plugin connector from the Hadoop plugin module. So I loaded the Hadoop plugin module, made it active and set the libhdfs and hadoop_client plugin module parameters.

create module hadoop;
alter module hadoop set mode active;
alter module hadoop set parameter libhdfs to '/opt/cloudera/parcels/CDH/lib64/libhdfs.so';
alter module hadoop set parameter hadoop_client to '/opt/cloudera/parcels/CDH/bin/hadoop';

The parameters indicate to the HDFS data provider where the libhdfs library and hadoop executable are on my Hadoop cluster.

Then I could create the connector that was needed:

create connector hdfs_con source hdfs target ' namenode hdfs://, bitness 64 ';

The create connector statement specifies where the namenode is and to use 64 bit mode, which is best.

Then I could create the external table:

create external table OSCodePoints
PostCode char(8),
Positional_quality_indicator varchar(32000),
Eastings varchar(32000),
Northings varchar(32000),
Country_code varchar(32000),
NHS_regional_HA_code varchar(32000),
NHS_HA_code varchar(32000),
Admin_county_code varchar(32000),
Admin_district_code varchar(32000),
Admin_ward_code varchar(32000)
from hdfs_con target
'file "/user/kodoopdev1/OSCodePoint2016Nov/*.csv"'

The ‘file’ parameter specifies which HDFS files contain the data I want to access.

I created a view with a view image:

create view PostCodes as select PostCode from OSCodePoints;
create view image PostCodes;

The view projects just the postcode column out. Creating the view image pins the results from the view into RAM. All queries on the view read data from the view image and ran faster as a result.

I generated check-sums using the md5_checksum, sha1_checksum, sha224_checksum, sha256_checksum, sha384_checksum and sha512_checksum functions. These are all implemented as plugin functions provided by the hashes plugin module. This was loaded and activated:

create module hashes;
alter module hashes set mode active;

I ran the check-sum functions with seven different insert-select queries to populate a set of lookup tables that mapped postcodes to their check-sums. Six of the queries generated a check-sum for one of each of the check-sum functions. The 7th query used all 6 check-sum functions to generate a lookup of table for all 6 check-sums simultaneously. This was done to get some data indicating what the overhead associated with retrieving and creating rows was. The 7 lookup tables were created as ram-only tables. As the name implies these hold the row data in RAM and not on disk and so removed the overhead of writing the rows to disk.

The ram-only table create statements were:

create ram only table PostCodeMD5Sums
PostCode char(8),
md5_checksum binary(16)

create ram only table PostCodeSHA1Sums
PostCode char(8),
sha1_checksum binary(20)

create ram only table PostCodeSHA224Sums
PostCode char(8),
sha224_checksum binary(28)

create ram only table PostCodeSHA256Sums
PostCode char(8),
sha256_checksum binary(32)

create ram only table PostCodeSHA384Sums
PostCode char(8),
sha384_checksum binary(48)

create ram only table PostCodeSHA512Sums
PostCode char(8),
sha512_checksum binary(64)

create ram only table PostCodeCheckSums
PostCode char(8) CHARACTER SET LATIN1,
md5_checksum binary(16),
sha1_checksum binary(20),
sha224_checksum binary(28),
sha256_checksum binary(32),
sha384_checksum binary(48),
sha512_checksum binary(64)

The insert-select queries were:

insert into PostCodeMD5Sums

insert into PostCodeSHA1Sums

insert into PostCodeSHA224Sums

insert into PostCodeSHA256Sums

insert into PostCodeSHA384Sums

insert into PostCodeSHA512Sums

insert into PostCodeCheckSums

Results And Conclusions

For the 1.7 million rows (1691721 to be exact) the times for the insert-select queries were:

md5_checksum: 0.62s
sha1_checksum: 0.53s
sha224_checksum: 0.63s
sha256_checksum: 0.67s
sha384_checksum: 0.83s
sha512_checksum: 0.86s
all 6 checksums: 2.87s

The results show that the check-sum for each postcode is being calculated in something around 500ns or less. This system could calculate about 170 billion check-sums a day. The Kognitio system was running on a node with two E5506s, giving 8 cores running at 2.13GHz, the node is several years old. It seems obvious to me the just using check-sums and/hashes alone isn’t going to hide data with this system, even more so with faster up-to-date hardware.

Comparing the “all 6 check-sums” insert-select query run time with those of the others did show that there was an appreciable overhead in retrieving and writing the row data. The total time taken to populate 6 lookup tables of just one check-sum value was 1.44 times more than the time taken to populate the one lookup table of all six check-sum functions.

Because I used the Code-Point Open data:
Contains OS data © Crown copyright and database right (2016)

Contains Royal Mail data © Royal Mail copyright and Database right (2016)

Contains National Statistics data © Crown copyright and database right (2016)

Using Kognitio on Amazon Elastic Map/Reduce


Posted By : Andy MacLean Comments are off
Kognitio on Amazon EMR

Using Kognitio on Amazon Elastic Map Reduce

Amazon’s Elastic Map/Reduce product provides Hadoop clusters in the cloud. We’ve had several requests for the Hadoop version of our product to work with EMR. As of release 8.1.50-rel161221 we have made the two products compatible so you can use EMR to run Kognitio clusters. This article will show you how to get Kognitio clusters up and running on EMR.

In order to run Kognitio on EMR you will need:

This article assumes some basic familiarity with Amazon’s environment and the EMR feature so if you’re new to Amazon you’ll probably want to experiment with it a little first before trying to create a large Kognitio cluster. I’m also assuming that you’re creating a brand new EMR cluster just for Kognitio. If you want to integrate Kognitio with an existing EMR cluster you will need to modify these instructions accordingly.

You can also follow this instructional video:

Getting ready to start

Before you start you’re going to need to decide how to structure the Hadoop cluster and how the Kognitio cluster will look on it. Amazon clusters consist of various groups of nodes – the ‘master node’, which runs Hadoop specific cluster master programs like the HDFS namenode and Yarn resource manager, the ‘Core’ group of nodes, which hold HDFS data and run Yarn containers and optional extra ‘Task’ groups, which run Yarn jobs but don’t hold HDFS data. When running on Hadoop, Kognitio runs as a Yarn application with one or more controlling ‘edge nodes’ that also act as gateways for clients. The Kognitio software itself only needs to be installed on the edge node(s) as the user running it, it gets transfered to other nodes as part of the Yarn task that runs it.

For most EMR clusters it makes sense to use the EMR master node as the Kognitio edge node so that’s how this example will work. There are other possible choices here – you can just use one of the cluster nodes, you can spin up a specific task group node to run it or you can just have an arbitrary EC2 node with the right security settings and client software installed. However the master node is already doing similar jobs and using it is the simplest way to get up and running. For the rest of the cluster it’s easiest to have no task groups and run the whole application on Core nodes, although using task groups does work if you need to do that.

Configuring the master node

The master node also needs to be configured so that it can be used as the controlling ‘edge node’ for creating and managing one or more Kognitio clusters. For this to work you need to create a user for the software to run as, set it up appropriately and install/configure the Kognitio software under that user. Specifically:

  • Create a ‘kodoop’ user
  • Create an HDFS home directory for it
  • Setup authentication keys for it
  • Unpack the kodoop.tar.gz and kodoop_extras.tar.gz tarballs into the user’s home directory
  • Configure slider so it can find the zookeeper cluster we installed
  • Configure the Kognitio software to make clusters that use compressed messages

You can do this with the following shell script:


#change the s3 bucket for your site

sudo useradd -c "kodoop user" -d /home/kodoop -m kodoop
HADOOP_USER_NAME=hdfs hadoop fs -mkdir /user/kodoop
HADOOP_USER_NAME=hdfs hadoop fs -chown kodoop /user/kodoop
sudo cp -r ~ec2-user/.ssh ~kodoop
sudo chown -R kodoop ~kodoop/.ssh

aws s3 cp $S3BUCKET/kodoop.tar.gz /tmp
aws s3 cp $S3BUCKET/kodoop-extras.tar.gz /tmp

sudo su - kodoop <<EOF
tar -xzf /tmp/kodoop.tar.gz
tar -xzf /tmp/kodoop-extras.tar.gz
echo PATH=~/kodoop/bin:\\\$PATH >>~/.bashrc

grep -v '<\/configuration>' kodoop/slider/conf/slider-client.xml >/tmp/slider-client.xml
cat <<XXX >>/tmp/slider-client.xml
cp  kodoop/slider/conf/slider-client.xml  kodoop/slider/conf/slider-client.xml.orig
cp /tmp/slider-client.xml  kodoop/slider/conf/slider-client.xml

cat >kodoop/config/server_defaults.cfg <<XXX
[runtime parameters]
rs_messcomp=1    ## turn on message compression

This script creates the user first, then it pulls the tarballs from an s3 bucket called s3://kognitio-development (You’ll want to change that to be your own bucket’s name and upload the tarballs into it). It then switches to the kodoop user, extracts everything and configures slider. The slider configuration required is the location of the zookeeper server which was installed with the cluster. This will be on port 2181 of the master node and this is the information that goes into slider-client.xml.

The final part of the script defines the rs_messcomp=1 setting for Kognitio clusters created on the EMR instance. This setting enables message compression, which causes messages to get compressed (with the LZ4 compression algorithm) before being sent over a network. This setting is not normally used but we recommend it for Amazon because the network:cpu speed ratio is such that it results in a speedup.

You can transfer this script to the master node and run it as ec2-user once the cluster starts, but it’s a lot nicer to have this run automatically as part of the cluster startup. You can do this by transfering the script to S3 and putting it together in a directory with the tarballs (and editing the s3 bucket name in the script appropriately). You can then specify the script during cluster creation as a custom action to get it run automatically (see below).

Creating the EMR cluster

Go to the Amazon EMR service in the AWS web console and hit ‘create cluster’ to make a new EMR cluster. You will then need to use ‘go to advanced options’ because some of the settings you need are not in the quick options. Now you have 4 pages of cluster settings to go through in order to define your cluster. Once you’ve done this and created a working cluster you will be able to make more clusters by cloning and tweaking a previous one or by generating a command line and running it.

This section will talk you through the settings you need to get a Kognitio cluster running without really getting into the other settings available. The settings I don’t mention can be defined any way you like.

Software Selection and Steps

Choose ‘Amazon’ as the vendor, select the release you want (we’ve tested it with emr-5.2.1 at the time of writing). Kognitio only needs Hadoop and Zookeeper to be selected from the list of packages, although adding others which you may need to run alongside it won’t hurt.

In the ‘Edit software settings’ box you may find it useful to enter the following:


This instructs yarn to preserve container directories for 1 hour after a container exits, which is very useful if you need to do any debugging.

If you want to have the master node configured automatically as discussed above, you will need to add an additional step here to do that. You can add a step by setting the step type to ‘Custom JAR’ and clicking configure. The Jar Location field should be set to s3://elasticmapreduce/libs/script-runner/script-runner.jar (if you like you can do s3://<regionname>.elasticmapreduce/ to make this a local read) and the argument is the full s3 path for the script you uploaded to s3 in the section above (e.g. s3://kognitio-development/kog-masternode). The script will now run automatically on the masternode after startup and the cluster will come up with a ‘kodoop’ user created and ready to go.

Hardware Selection

In the hardware selection page you need to tell EMR how many nodes to use and which type of VM to use for them. Kognitio doesn’t put much load on the master node so this can be any instance type you like, the default m3.xlarge works well.

The Core nodes can generally be anything which has enough memory for your cluster and the right memory:CPU ratio for you. For optimal network performance you should use the largest of whatever node type instead of a larger number of smaller instances (so 3x r3.8xlarge instead of 6x r3.4xlarge for example). The r3.8xlarge or m4.16xlarge instance types are good choices. You will want to use more RAM than you have data because of the Hadoop overhead and the need for memory workspace for queries. A good rule of thumb is to have the total RAM of the nodes which will be used for the Kognitio cluster be between 1.5x and 2x the size of the raw data you want to load as memory images.

You won’t need any task groups for this setup.

General Cluster Settings and Security

In the ‘General Cluster Settings’ pane you will want to add a bootstrap action for your node. This is required because the AMI used by EMR needs to have a small amount of configuration done and some extra Linux packages installed in order for it to run Kognitio’s software. The best way to do this is to place a configuration script in an S3 bucket and define this as a ‘custom action’ boostrap action. The following script does everything you need:


sudo yum -y install glibc.i686 zlib.i686 openssl.i686 ncurses-libs.i686
sudo mount /dev/shm -o remount,size=90%
sudo rpm -i --nodeps /var/aws/emr/packages/bigtop/hadoop/x86_64/hadoop-libhdfs-*

This script installs some extra Linux packages required by Kognitio. Then it remounts /dev/shm to allow shared memory segments to use up to 90% of RAM. This is necessary because Kognitio clusters use shared memory segments for nearly all of the RAM they use. The final step looks a bit unusual but Amazon doesn’t provide us with a simple way to do this. Kognitio requires libhdfs but Amazon doesn’t install it out of the box unless you install a component which uses this. Amazon runs the bootstrap action before the relevant repositories have been configured on the node so the RPM can’t be installed via yum. By the time we come to use libhdfs all the dependencies will be in place and everything will work.

Finally, the Kognitio server will be accessible from port 6550 on the master node so you may want to configure the security groups in ‘EC2 Security Groups’ to make this accessible externally.

Creating a Kognitio cluster

Once you have started up your cluster and created the kodoop user (either manually or automatically), you are ready to build a Kognitio cluster. You can ssh into the master node as ‘kodoop’ and run ‘kodoop’. This will invite you to accept the EULA and display some useful links for documentation, forum support, etc that you might need later. Finally you can run ‘kodoop testenv’ to validate that the environment is working properly.

Once this is working you can create a Kognitio cluster. You will create a number of Yarn containers with a size you specify. You will need to choose a container size, container vcore count and a number of containers that you want to use for the cluster. Normally you’ll want to use a single container per node which uses nearly all of the memory. You can list the nodes in your cluster on the master node like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -list
17/01/09 18:40:26 INFO client.RMProxy: Connecting to ResourceManager at ip-172-40-0-213.eu-west-1.compute.internal/
Total Nodes:3
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
ip-172-40-0-91.eu-west-1.compute.internal:8041          RUNNING ip-172-40-0-91.eu-west-1.compute.internal:8042                             1
ip-172-40-0-126.eu-west-1.compute.internal:8041         RUNNING ip-172-40-0-126.eu-west-1.compute.internal:8042                            2
ip-172-40-0-216.eu-west-1.compute.internal:8041         RUNNING ip-172-40-0-216.eu-west-1.compute.internal:8042                            1

Then for one of the nodes, you can find out the resource limits like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -status ip-172-40-0-91.eu-west-1.compute.internal:8041
17/01/09 18:42:07 INFO client.RMProxy: Connecting to ResourceManager at ip-172-40-0-213.eu-west-1.compute.internal/
Node Report : 
        Node-Id : ip-172-40-0-91.eu-west-1.compute.internal:8041
        Rack : /default-rack
        Node-State : RUNNING
        Node-Http-Address : ip-172-40-0-91.eu-west-1.compute.internal:8042
        Last-Health-Update : Mon 09/Jan/17 06:41:43:741UTC
        Health-Report : 
        Containers : 0
        Memory-Used : 0MB
        Memory-Capacity : 253952MB
        CPU-Used : 0 vcores
        CPU-Capacity : 128 vcores
        Node-Labels :

The ‘Memory-Capacity’ field here shows the maximum container size you can create and CPU-Capacity shows the largest number of vcores. In addition to the Kognitio containers, the cluster also needs to be able to create a 2048MB application management container with 1 vcore. If you set the container memory size to be equal to the capacity and put one container on each node then there won’t be any space for the management container. For this reason you should subtract 1 from the vcore count and 2048 from the memory capacity.

You will also need to choose a name for the cluster which must be 12 characters or less and can only contain lower case letters, numbers and an underscore. Assuming we call it ‘cluster1’ we would then create a Kognitio cluster on the above example cluster like this:

CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1

This will display the following and invite you to confirm or cancel the operation:

[kodoop@ip-172-40-0-213 ~]$ CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1
Kognitio Analytical Platform software for Hadoop ver80150rel170105.
(c)Copyright Kognitio Ltd 2001-2017.

Creating Kognitio cluster with ID cluster1
Cluster configuration for cluster1
Containers:               3
Container memsize:        251904 Mb
Container vcores:         127

Internal storage limit:   100 Gb per store
Internal store count:     3

External gateway port:    6550

Kognitio server version:  ver80150rel170105

Cluster will use 738 Gb of ram.
Cluster will use  up to 300 Gb of HDFS storage for internal data.

Data networks:            all
Management networks:      all
Edge to cluster networks: all
Using broadcast packets:  no
Hit ctrl-c to abort or enter to continue

If this looks OK, hit enter and the cluster will be created. Once creation is completed you will have a working Kognitio server up and running and ready to use.

Next steps

At this point you should have a working Kognitio cluster up and ready to use. If you’re already a Kognitio user you probably know what you want to do next and you can stop reading here. This section is intended as a very brief quickstart guide to give new users an idea of the most common next steps. This is very brief and doesn’t cover all the things you can do. Full documentation for the features discussed below is available from our website.

You can download the Kognitio client tools from www.kognitio.com, install them somewhere, run Kognitio console and connect to port 6550 on the master node to start working with the server. Alternatively you can just log into the master node as kodoop and run ‘kodoop sql <system ID>’ to issue sql locally. Log in as ‘sys’ with the system ID as the password (it is a good idea to change this!).

There are now lots of different ways you can set up your server and get data into it but the most common thing to do is to build memory images (typically view images) to run SQL against. This is typically a two step process involving the creation of external tables which pull external data directly into the cluster followed by the creation of view images on top of these to pull data directly from the external source into a memory image. In some cases you may also want to create one or more regular tables and load data into them using wxloader or another data loading tool, in which case Kognitio will store a binary representation of the data in the HDFS filesystem.

Connecting to data in HDFS

Kognitio on Hadoop starts with a connector called ‘HDFS’ which is configured to pull data from the local HDFS filesystem. You create external tables which pull data from this either in Kognitio console or via SQL. To create external tables using console you can open the ‘External data sources’ part of the object tree and expand ‘HDFS’. This will allow you to browse the object tree from console and you’ll be able to create external tables by right clicking on HDFS files and using the external table creation wizard.

To create an external table directly from SQL you can use a syntax like this:

create external table name (<column list>) from HDFS target 'file /path/to/csv/files/with/wildcards';

Kognito is able to connect to a variety of different data sources and file formats in this manner. See the documentation for full details. As a quick example we can connect to a 6 column CSV file called test.csv like this:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from HDFS target 'file /path/to/file/test.csv';

If instead it is a directory full of csv files we can use ‘/path/to/file/test/*.csv’ instead to use them all as a single table in Kognitio.

Connecting to data in Amazon S3

Kognitio can also pull data directly out of Amazon S3. The Amazon connector is not loaded by default and it isn’t able to use the IAM credentials associated with the EMR nodes so you need to get a set of AWS credentials and configure your server with the following SQL:

create module aws;
alter module aws set mode active;
create group grp_aws;

create connector aws source s3 target 
secretkey "YOUR_SECRET_KEY"
max_connectors_per_node 5
bucket your-bucket-name

grant connect on connector aws to grp_aws;

This sql loads the Kognitio Amazon plugin, creates a security group to allow access to it and then creates an external table connector which uses the plugin. You will need to give the connector some Amazon credentials where it says YOUR_ACCESS_KEY and YOUR_SECRET_KEY and you will need to point it at a particular storage bucket. If you want to have multiple storage buckets or use multiple sets of credentials then create multiple connectors and grant permission on different ones to appropriate sets of users. Granting the ‘connect’ permission on a connector allows users to make external tables through it. In this case you can just add them to the group grp_aws which has this.

max_connectors_per_node is needed here because the amazon connector gives out of memory errors if you try to run too many instances of it in parallel on each node.

Now an external table can be created in exactly the same way as in the HDFS example. If my amazon bucket contains a file called test.csv with 6 int columns in it I can say:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from AWS target 'file test.csv';

Creating memory images

Once you have external tables defined your server is ready to start running queries, but each time you query an object the server will go out to the remote data and pull it into the server. Kognitio is capable of running like this but most people prefer to create memory images and query those instead because this allows data to be queried very fast. There are several different kinds of memory image in Kognitio but the most commonly used images are view images. With a view image the user defines a view in the normal SQL way and then they image it, which makes an in-memory snapshot of the query. This can be done with this SQL:

create view testv as select * from test;
create view image testv;

So testv is now a memory image. Images can be created with various different memory distributions which tell the server which nodes will store which rows. The most common of these are:

  • Hashed — A hash function on some of the columns determines which nodes get which rows
  • Replicated — Every row goes to every ram processing task
  • Random — Just put the rows anywhere. This what we will get in the example above.

The various memory distributions can be used to help optimise queries. The server will move rows about automatically if they aren’t distributed correctly but placing rows so they are co-located with certain other rows can improve performance. As a general rule:

  • Small tables (under 100M in size) work best replicated
  • For everything else hash on the primary key except
  • For the biggest images which join to non-replicated tables hash on the foreign key to the biggest of the foreign tables
  • Use random if it isn’t obvious what else to use

And the syntax for these is:

create view image test replicated;
create view image test hashed(column, column, column);
create view image test random;

Imaging a view which queries one or more external tables will pull data from the external table connector straight into RAM without needing to put any of it in the Kognitio internal storage. Once the images are built you are ready to start running SQL queries against them.

Monitoring Kognitio from the Hadoop Resource Manager and HDFS Web UI


Posted By : Alan Kerr Comments are off
monitoring kadoop clusters

If you’ve already installed Kognitio on your Hadoop distribution of choice, or are about to, then you should be aware that Kognitio includes full YARN integration allowing Kognitio to share the Hadoop hardware infrastructure and resources with other Hadoop applications and services.

Latest resoures for Kognitio on Hadoop:

Download:  http://kognitio.com/on-hadoop/

Forum:   http://www.kognitio.com/forums/viewforum.php?f=13

Install guide: (including Hadoop pre-requisites for Kognitio install):


This means that YARN (https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html)  (Hadoop’s preferred resource manager) remains in control of the resource allocation for the Kognitio cluster.

Kognitio clusters can be monitored from the apache YARN resource manager UI, and the HDFS name node UI.

You can reach the YARN resource manager UI from your Hadoop management interface -> YARN -> Web UI, or point your browser to the node running the resource manager (default) port 8088.

hadoop screen running applications

The major Hadoop distributions all support the apache YARN resource manager: Cloudera, Hortonworks, MapR, and IBM.

From the Cloudera management interface reach the YARN Web UI from:

cloudera manager clusters

And for the HDFS UI which is typically accessible by pointing to the name node on port 50070:

hadoop directory

Or, use the Kognitio Console external data browser:

hdfs file structure

The Kognitio on Hadoop cluster

Kognitio is designed as a persistent application running on Hadoop under YARN.

A Kognitio cluster can be made up from 1 or more application containers. Kognitio uses apache slider (https://slider.incubator.apache.org/) to deploy, monitor, restart, and reconfigure the Kognitio cluster.

A single Kognitio application container must fit onto a single data node. It is recommended not to size Kognitio containers less than 8GB RAM. All application containers within a Kognitio cluster will be sized the same. YARN will place the application containers. It is possible to have multiple application containers from the same Kognitio cluster running on the same data node.

For example, to size for a 1TB RAM Kognitio instance you could choose one of the following options:

64 x 16GB RAM application containers,
32 x 32GB RAM application containers,
16 x 64GB RAM application containers,
8 x 128GB RAM application containers,
4 x 256GB RAM application containers,
2 x 512GB RAM application containers

Of course, the choice is restricted by the Hadoop cluster, the size and available resource on the data nodes.

Starting a Kognitio on Hadoop cluster

YARN creates an application when a Kognitio cluster is started. This application will be assigned an ApplicationMaster (AM). A slider management container is launched under this application. The slider manager is responsible for the deployment, starting, stopping, and reconfiguration of the Kognitio cluster.

The slider manager runs within a small container allocated by YARN and it will persist for the lifetime of the Kognitio cluster. Requests are made from the slider manager to YARN to start the Kognitio cluster containers. The YARN ApplicationMaster launches each container request and creates application containers under the original application ID. The Kognitio package and server configuration information will be pulled from HDFS to each of the application containers. The Kognitio server will then start within the application container. Each container will have all of the Kognitio server processes running within it (ramstores, compilers, interpreters, ionodes, watchdog, diskstores, smd).

It should be noted here that Kognitio is dynamically sized to run within the container memory allocated. This includes a 7% default fixed pool of memory for external processes such as external scripts. Kognitio runs within the memory allocated to the container. If you have a requirement to use memory intensive external scripts, then consider increasing the fixed pool size and also increasing the container memory size to improve script performance.

If there is not enough resource available for YARN to allocate an application container then the whole Kognitio cluster will fail to start. The “kodoop create cluster…” command submitted will not complete. Slider will continue to wait for all the application containers to start. It is advisable to exit at this point and verify resource availability and how the YARN resource limits have been configured on the Hadoop cluster.

Hadoop yarn defaults for Hadoop 2.7.3: https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

Settings of interest when starting Kognitio clusters /containers:

yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler The class to use as the resource scheduler.
yarn.scheduler.minimum-allocation-mb 1024 The minimum allocation for every container request at the RM, in MBs. Memory requests lower than this will throw a InvalidResourceRequestException.
yarn.scheduler.maximum-allocation-mb 8192 The maximum allocation for every container request at the RM, in MBs. Memory requests higher than this will throw a InvalidResourceRequestException.
yarn.nodemanager.resource.memory-mb 8192 Amount of physical memory, in MB, that can be allocated for containers.
yarn.nodemanager.pmem-check-enabled true Whether physical memory limits will be enforced for containers.
yarn.nodemanager.vmem-check-enabled true Whether virtual memory limits will be enforced for containers
yarn.nodemanager.vmem-pmem-ratio 2.1 Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.

NOTE: These are YARN default values, not recommended Kodoop settings

The default settings are going to be too small for running Kognitio. As mentioned already, Kognitio containers are sized between 16GB RAM and 512GB RAM, or higher. The ‘yarn.nodemanager.resource-mb’ should be set to a size to accommodate the container(s) allocated to a node. With other services running on the Hadoop cluster having a site-specific value here to limit the memory allocation for a node or group of nodes may be necessary.

Once Kognitio cluster containers have been allocated by the YARN ApplicationMaster the container will transition to a RUNNING state. Once the Kognitio server is started within each of the application containers, a SMD master will be elected for the Kognitio cluster on Hadoop in the same way as SMD would work on a multi-node Kognitio stand-alone appliance. The Kognitio cluster will now run through a system “newsys” to commission.

hadoop application software queues

From the Kognitio edge node (from the command line) you can stop | start | reconfigure the Kognitio cluster. Stopping a Kognitio cluster changes the YARN application state to FINISHED. All of the application containers and slider manager container will be destroyed. Restarting a Kognitio cluster creates new YARN ApplicationMaster and creates a new slider management and application containers.

hadoop applications running

Because the data will persist on HDFS when a Kognitio cluster is restarted all of the existing metadata, and database objects remain. Memory images will not be recoverable after a Kognitio cluster restart, although they will be recoverable after a Kognitio server start.

What if YARN kills the Kognitio on Hadoop cluster?

It is possible for YARN to kill the Kognitio cluster application. This could happen to free up memory resources on the Hadoop cluster. If this happens it should be treated as though the “kodoop cluster stop” command has been submitted. The HDFS for the cluster will persist and it is possible to start the cluster, reconfigure the cluster, or remove the cluster.

hadoop application list killed

Slider Logs

As a resource manager, YARN can “giveth resource and can also taketh away”. The Kognitio server application processes run within the Kognitio application container process group. YARN ApplicationMaster for each Kognitio cluster will monitor the container process groups to make sure allocated resource is not exceeded.

In a pre-release version of Kognitio on Hadoop a bug existed whereby too many processes were being started within a Kognitio application container. This would make the container susceptible to growing larger than the original container resource allocation when the Kognitio cluster was placed under load. The YARN ApplicationMaster would terminate the container. If this happened it would be useful to check the slider logs to determine the root cause of why the container was killed.

The slider logs for the Kognitio cluster can be accessed from the YARN web UI.

hadoop application attempt

The image shows that a Kognitio container has been restarted because the container ID which increments sequentially as containers are added to the Kognitio cluster is now missing “container_1477582273761_0035_01_000003”, and a new “container_1477582273761_0035_01_000004” has been started in its place. It is possible to examine the slider management container log to determine what happened to the container that is no longer present in the running application.

hadoop resource manager logs

With Kognitio auto-recovery enabled, if a container is terminated due to running beyond physical memory limits then the cluster will not restart. It is advised to determine the cause of the termination before manually restarting the cluster. If Kognitio suffers a software failure with auto-recovery enabled, then Kognitio will automatically restart the server.

In the event of a container being terminated, use the slider logs to scroll to where the container was killed. In this example case it was because the container had exceeded its resource limit.

containerID=container_1477582273761_0035_01_000003] is running beyond physical memory limits. Current usage: 17.0 GB of 17 GB physical memory used; 35.5 GB of 35.7 GB virtual memory used. Killing container.

The slider log also contains a dump of the process tree for the container. The RSSMEM_USAGE (PAGES) column for the processes showed that the resident memory for the process group exceeded 18.2GB for a 17GB allocated application container.

Copy the process tree dump from the slider log to a file and sum the rssmem pages to get a total:
awk 'BEGIN{FS=" ";pages=0;rssmem=0;}{pages+=$10;rssmem=(pages*4096);print(pages,rssmem,$10, $6)}' file