A Review of DataWorks Summit, San Jose 2017

19

Jun
2017
Posted By : Mark Chopping 0 Comment

The DataWorks Summit in San Jose was held on June 13-15, and this blog post summarises interesting talks at the event.

Keynote section

Sumeet Singh (Yahoo)

Sumeet talked about Yahoo’s migration from MapReduce jobs to those running on Tez on the 39K+ nodes that they use for Hadoop processing with over 1M jobs per day. In the last year, MapReduce jobs have dropped from about 70% of the workload to around 10%, with Tez moving in the opposite direction (Spark job level remaining consistent). This has also allowed improved utilisation of the compute and storage infrastructure.

In addition, updating the version of Hive in use has led to a significant lowering of latency for analytics jobs. The following slide shows how the most common query runtime is now in the 5-30 second range (for about 1.2 million queries per month, out of a total of 2 million per month), although you can see how this increases with the number of records involved – on the right-hand side of the chart are the jobs which take over 5 minutes as the average number of records involved rises to around 15 billion.

Girish Mundada (HPE)

Girish had previously been a Postgres developer, and highlighted a number of lessons learned from this career:

  • databases never finish (in terms of development, not individual query runtime)
  • there a multiple ways to solve the same problem
  • use the right set of tools for the job (which seems relevant given the number of SQL on Hadoop alternatives that exist, and the possibility of deploying multiple of these solutions on your Hadoop cluster)

The real crux of his talk was explaining the complexity of Hadoop for many companies, and hence the benefit of using HPE or some other vendor (or indeed cloud infrastructure company) to simplify the process for these companies.

Hadoop Query Performance Smackdown (Michael Fagan, Comcast)

Michael and his co-speaker talked about their benchmarking of SQL on Hadoop solutions, using 66 of the TPC-DS queries and a variety of file formats.

The platform used was 11 worker nodes with 128GB RAM and 32 cores each, plus 5 master nodes with 90GB  RAM and 32 cores each. The system had HDP 2.6 installed.

They chose to impose a 10 minute penalty for any failing queries, and all the engines used in their test had failures (from 1 for Hive LLAP, to 8 for Hive on MapReduce). They had issues with the Spark Thrift Server which led to very inconsistent timings for SparkSQL and long garbage-collection pauses – their feedback on this was to wait for improvements rather than rule this out permanently based on current performance.

From their timings, LLAP came out best, just ahead of Presto for the SQL engines (the latter having issues with date-related casting which was a big part of its 5 failing queries). Of the 66 queries, LLAP was fastest for 44, Presto for 16, and Tez for 6. They viewed LLAP and Presto as solid, with no major issues in their 3 months of testing.

On file formats, Bzip compressed text and sequence files performed the worst, which was a caveat against just dumping existing data into Hadoop using these formats. ORC Zlib was the benchmark winner, just ahead of Parquet.

In response to questions from the audience, the lack of concurrency in the tests was mentioned, as was the subset of TPC-DS queries run. Kognitio’s own benchmarking using TPC-DS queries did use concurrency, and did use all the queries – more information can be found at http://go.kognitio.com/hubfs/Whitepapers/sql-on-hadoop-benchmarks-wp.pdf

Tez Shuffle Handler: Shuffling At Scale with Apache Hadoop (John Eagles, Yahoo)

John used a Yahoo quote early in his talk, “When you run at scale, there are no corner cases”. He then discussed a number of rare performance issues seen as Yahoo increased adoption of Tez internally, where the sophistication in Tez had outgrown the MapReduce shuffle handler.

For example, the slide on the right shows a problem with auto-reduce when an operation was expected to use 999 reducers but ended up with just 1. This ended up having to retrieve data from 999 partitions on each of 4300 mappers, for a meagre total of 450MB. Due to the number of operations, this shuffle took 20 minutes. So Yahoo introduced a composite or ranged fetch to allow multiple partitions to be retrieved in one operation, reducing the shuffle time to 60 seconds.

Similar issues were also seen with e.g. an auto-reduce merge – this time the composite fetch only sped up the operation from 50 minutes to 20 minutes as the real problem was an inefficiency in the way merges with a massive number of inputs (17 million for the query in question) were handled, and fixing this reduced the shuffle time to 90 seconds.

ORC File – Optimizing Your Big Data (Owen O’Malley, Hortonworks)

Owen discussed the importance of stripe size with the default being 64MB. The setting is a trade-off with larger stripes giving large, more efficient reads, but smaller stripes requiring less memory and giving more granular processing splits. When writing multiple files concurrently the strip size is automatically shrunk, but sorting dynamic partitions means only one writer is active at a time.

He also covered HDFS block padding settings to align stripes with HDFS blocks, which gives a performance win at the cost of some storage inefficiency.

Predicate push down was covered, allowing parts of files to be skipped which cannot contain valid rows. ORC indexes at the file, stripe and row group (10K rows) level, to allow push down at various granularities.

Sorting data within a file rather than creating lots of partitions allows row pruning, and bloom filters can be used to improve scan performance. Again, there is a trade-off between the space used for these filters and their efficiency, which can be controlled via a parameter. A good example of row pruning occurs in TPC-DS with a literal predicate on the line item table – using no filters, 6M rows are read, with just the min/max metadata this is reduced to 540K rows, and with bloom filters it drops to 10K rows.

Column encryption is supported, allowing some columns of a file to be encrypted (both data and index). The user can specify how data is treated when the user does not have access (nullify / redact / SHA256).

An Overview On Optimization In Apache Hive: Past, Present, Future (Hortonworks)

This talk mentioned the use of multiple execution engines (Tez, Spark), vectorized query execution to integrate with columnar storage (ORC, Parquet), LLAP for low latency queries.

It covered the need for a query optimizer, and the challenge between plan generation latency and optimality. From 0.14.0, Hive has used Calcite for its logical optimiser, and gradually shifted logic from Hive to Calcite. The slide on the right shows some of the optimizer improvements made in this period.

For logical optimization there are rule-based and cost-based phases, with over 40 different rewriting rules (including pushdown filter predicates, pushdown project expressions, inference or new filter predicates, expression simplification, …). The rules also allow queries Hive could not otherwise execute to be transformed into an executable representation – e.g. queries with INTERSECT, EXCEPT, … will be rewritten to use JOIN, GROUP BY, …

Calcite’s join reordering also allows bush query plan to be generated (e.g. join table A and B, then C and D, then join the results together, rather than just adding a table to the already-joined results each time).

Work is in progress on materialized view support, and future plans include collecting column statistics automatically, making better estimates of number of distinct values, and speeding up compilation. There should be an update on Hive performance on the Hortonworks blog in the next few weeks.

Running A Container Cloud On Yarn (Hortonworks)

Hortonworks builds, tests and releases open source software. As such, it does dozens of releases a year, with tens of thousands of tests per release across over a dozen Linux versions and multiple back-end databases. Therefore, they are looking to reduce overhead, and achieve greater density and improved hardware utilization.

Using a container cloud eliminates the bulk of virtualization overhead, improving density per node. Containers also help reduce image variance through composition. Startup time is fast, as there is no real boot sequence to run.

The building blocks for this cloud are:

  • YARN Container Runtimes – enable additional container types to make it easier to onboard new applications/services.
  • YARN Service Discovery – allow services running on YARN to easily discover one another.
  • YARN Native Services – enable long running YARN services.

For service discovery, the YARN Service Registry allows applications to register themselves, allowing discovery by other applications. Entries are stored in Zookeeper. The registry entries are exposed via the YARN DNS server, which watches the registry for changes and creates the corresponding DNS entry at the service level and container level.

 

The Columnar Roadmap: Apache Parquet and Apache Arrow (Julien Le Dem, Dremio)

Parquet and Arrow provide columnar storage for on-disk and in-memory respectively. The former has a focus on reading, with the expectation that data will be written once and read many times, whereas the latter is often for transient data and aims for maximisation of CPU throughput via efficient use of CPU pipelining, SIMD, and cache locality (which columnar structures support given that all the values for a given column are adjacent rather than interleaved with other columns).

The trade-off can be seen with e.g. Arrow having data in fixed positions rather than saving the space for NULL values, which gives better CPU throughput at the cost of some storage overhead.

The goal is for projects to adopt e.g. Arrow as a hub for interoperability, removing the need for duplicating functionality in many different projects, and also avoiding CPU costs of serialising/deserialising when moving data between different projects.

Exhibition Hall

As usual, a lot of people came to the Kognitio stand to talk about their practical problems in running SQL on Hadoop. Typically these revolve around connecting a lot of business users with their favourite tool (e.g. Tableau, MicroStrategy) to data stored in Hadoop. With a small number of users they tend to see issues, and they are then looking for a product which can give query performance on Hadoop with much higher levels of concurrency.

As mentioned in the query performance smackdown commentary above, this whitepaper has good information on benchmarking Impala, SparkSQL and Kognitio for industry-standard queries, including running with concurrency rather than a single stream, so if you read this post and have an SQL on Hadoop issue, that is a good reference point to start with.

A Review of Strata Data Conference, London 2017

27

May
2017
Posted By : Mark Chopping Comments are off
Tags :  

The Strata Data Conference was held at the ExCeL in London this week. Sessions that were of interest to me included:

What Kaggle has learned from almost a million data scientists (Anthony Goldbloom, Kaggle)

This was part of the keynote on the first day. Kaggle is a platform for data science competitions, and have had almost 1 million users participate. Over 4 million models have been submitted to competitions, and this presentation covered off the traits Kaggle have seen for successful entrants.

In particular, for structured data the trend is for competitors to initially explore the data via histograms etc. to get a better understanding of it, then create and select features for use in their approach, which typically involves a classifier. The work on features is more important than the choice of classifier, and successful competitors are usually very creating in choosing features (e.g. car colour type for predicting car resale value), and persistent as most intuitions around feature selection/creation prove to have little correlation with the end goal. Finally, the best competitors tend to use version control for their models, to make it easier to track the success/failure of each approach.

Creating real-time, data-centric applications with Impala and Kudu (Marcel Kornacker, Cloudera)

The room was packed for this session as Marcel gave some background on Kudu (a relational store that can be used as an alternative to HDFS) and Impala. He explained that Kudu avoided the limitations on delete, update and streaming inserts  seen with HDFS, and the poor full table scan performance of HBase. As Kudu does not use HDFS for persistence, it handles its own replication, although this means it can’t e.g. benefit from the reduced storage overhead planned for HDFS in Hadoop 3. One workaround would be to only keep e.g. the latest 12 months worth of data in Kudu, and push older data into HDFS to benefit from its reduced storage overhead when that is available.

Kudu has been tested up to 275 nodes in a 3PB cluster, and internally uses columnar format when writing to disk, having collected records in RAM prior to this transposition. It allows range and hash partitioning to be combined. For example, you could use range partitioning by date, but then hash within a date to keep some level of parallelism when dealing with data for one particular date. Currently it only supports single-row transactions but the roadmap includes support for multi-row. From the examples given it appears there are some local predicates it cannot handle (e.g. LIKE with a regular expression), and batch inserts are reportedly slow. Multi-versioning is used as with Kognitio and many other products.

Impala can use all three of those storage options (Kudu, HDFS, HBase), has over 1 million downloads, and is reportedly in use at 60% of Cloudera’s customers.

Tuning Impala: The top five performance optimizations for the best BI and SQL analytics on Hadoop (Marcel Kornacker, Cloudera) 
Marcel started by going through some performance benchmarks involving Impala, and highlighted the importance of moving beyond single user benchmarks.
He then moved onto some techniques for improving Impala performance including:
Partitioning: partitions can be eliminated by join lookup to generate run-time filters (what Kognitio used to call spider joins) – so if joining fact and date tables on a date key and having a local predicate on the data table, then that predicate can be used to generate a list of relevant date keys, and that filter can be applied to the fact table before the join. This appeared to be the biggest factor in Impala’s TPC-DS performance improvements in early 2016. Marcel advised regularly compacting tables to keep file and partition sizes as small as possible, and gave general advice to stick with less than 20,000 partitions (too few and you don’t eliminate enough data with filters, too many and you lose parallelism and put extra load on name node etc.). As in the example above, partition on join keys to get benefit from run-time filters.
Sorting: this will be added in the next release. It is particularly useful as Parquet can store stats on e.g. min and max values within a page, so sorting can help eliminate some of those pages when there are too many column values for partitioning.
Use appropriate data types: some operations are a lot more expensive with different data types (e.g. strings), so try to avoid using these expensive data types. Kognitio used to offer similar advice to customers before modifying their product to make operations like string aggregation as efficient as integer aggregation.
Complex schemas: parent-child relationships with nested collections offer physical colocation of data, giving a natural optimization. Need to use columnar storage for this approach as resulting tables are wide.
Statistics: it takes a long time to collect these, so customers often ask if they can get the same effect with using e.g. optimiser hints to determine the order of joins. That is not the case, as statistics are used for far more than determining join ordering – e.g. scan predicates are order by selectivity and cost, the selectivity of scans is computed, join sides need to be determined for efficiency, join type needs to be decided, a decision on whether to apply run-time filters needs to be made (as presumably the cost of generating and applying these can be significant). The ability to collect statistics on samples is being added, which would speed up the stats collections.
A deep dive into Spark SQL’s Catalyst optimizer (Herman van Hovell tot Westerflier, Databricks)
In an early slide entitled “Why structure” Herman showed the performance benefit of using SQL for a simple aggregation tasks rather than working directly on RDDs with code. He then outlined the approach for query optimization used by Catalyst, from ensuring the query was valid syntactically and semantically, generating a logical plan, optimizing that plan, then generating physical plans which have their cost compared until a final physical plan is selected.
He discussed the use of partial functions to specify transformations of plans (e.g. constant folding), and then showed how it was possible to write your own planner rules to be incorporated into optimization.
It wasn’t clear to me how the optimizer deals with having a vast number of potential rules to apply, with some being inapplicable at one stage in optimization, but then being valid later on in the process after other rules have been applied.
 Journey to AWS: Straddling two worlds (Calum Murray, Intuit)
A very interesting talk on Intuit’s motivation for and subsequent execution of a migration from on-premise software to the cloud (Amazon, in their case).
The  key takeaways were:
  • use a tip-toe approach rather than big-bang. Be in a position where you can flip back and forth from cloud for an application, then swap to cloud initially for a small number of minutes, gradually increasing to 24 hours or more.
  • swim-lane applications first, if possible, to allow for this approach (so you are only exposing a small subset of users to the change initially).
  • consider security implications – in their case, with over 10 million customers, they had to put extra encryption in place, use different encryption keys for on-premise and cloud, etc.

You can find speaker slides and videos from the conference at https://conferences.oreilly.com/strata/strata-eu/public/schedule/proceedings

Exhibition Hall

Conversations at the Kognitio stand in the exhibition hall reflected an increasing interest in how to address SQL on Hadoop problems with company’s installed Hadoop systems. This was a marked step forward from last year’s conference where a lot of the conversations were with companies that were just starting out with Hadoop, and hadn’t done enough with Hive/Impala/Spark SQL to have encountered significant problems that they needed to solve.

Visiting my neighbour S3 and his son JSON

23

May
2017
Posted By : Chak Leung Comments are off

They also live across the river and there’s no bridge…so we’ll just make our own!

Amazon’s S3 is a popular and convenient storage solution which many, especially those with big data, tend to utilise, and the challenge can be connecting to this large store that has been building up over days/weeks/months/years. There are many ways to do this, no doubt, you could do a curl/wget but it’s in its raw form and converting it for use with databases isn’t always simple.
With Kognitio external connectors and tables you can connect to and parse it on the fly.

Let’s see how we can do this with JSON data and external tables in Kognitio where we’ll be able to use externally stored data as if they were local tables. We’ll also be utilising this in a later blog post where we’ll have some larger data.

Why would you want to use this though? What are the benefits?

•  Access data stored elsewhere without having to physically move them - streamlining the ETL process
•  Use data as if it were stored locally - no need to rewrite proprietary scripts and processes
•  Easily update with the newest data and improving access times using memory images

Convinced? Great! here’s what you’ll need to get started:

•  An S3 bucket with JSON data
•  Access and secret keys to S3 bucet
•  Kognitio with support for external tables

You can read more about external tables in chapter 8 of the Kognitio guide here.

In a nutshell the three typical components to this process are; stored data, a connector and an external table.

Let’s try it with our sample JSON file containing a customer’s form data.

{
    "firstName": "John",
    "lastName": "Smith",
    "age": 25,
    "address":
    {
        "streetAddress": "21 2nd Street",
        "city": "New York",
        "state": "NY",
        "postalCode": "10021"
    },
    "phoneNumber":
    [
        {
            "type": "home",
            "number": "212 555-1234"
        },
        {
            "type": "fax",
            "number": "646 555-4567"
        }
    ]
}

Source: https://www.sitepoint.com/customer-form-json-file-example/

The first step is to put this into S3, you can use Bash’s curl or something like the boto package in Python to do this. You can also use something like S3 browser to confirm that the file is in our bucket.

Now we can start building connections with external tables in Kognitio to access them. External tables don’t actually import the data from their source but bridge a connection to them. This enables us to access the data without having to go through arduous ETL processes to include the newest batch.

So firstly we need the bridge, an S3 based connector:

create connector my_test_connector source s3
target 'bucket <your_bucket> ,
accesskey <your_s3_access_key> ,
secretkey <your_s3_secret_key> ,
connectors_on_nodes "<node_name>",
proxy "<your_proxy_address>:<port>"';

The connectors_on_nodes is only if your nodes have restricted external access and you need to tell the connector which one can connect to the outside world. Same kind of situation with the proxy, use if required. Once you’ve created your connector, you can see it in the systems pane:

Connector in schema

Let’s see if our connector works, run:

external table from my_test_connector target 'list sample.json';

Files in S3

And we can see the file we placed in S3 and other information about it. Note that you can use the asterisk “*” like in Bash to include all JSON files e.g. “*.json”. If you have JSON files holding different information, you can always tag them with a prefix e.g. “prefix1*.json”. This will be useful when you need to build several external tables with different JSON sources.

Creating an external table uses the following syntax:

create external table . (
    sourcefile varchar(500)
    ,object_id int
    ,first_name varchar(20)
    ,last_name varchar(20)
    ,c_age int
)
from my_test_connector
target 'file "sample.json"
    ,conv_error_name "TEST_S3", ignore_invalid_records
    ,fmt_json 1, format "APPLY(firstvalid(?,null))
    ,inputbasename()
    ,objectnumber()
    ,firstName
    ,lastName
    ,age
';

The first part of the query encased in brackets defines the columns similar to typical table creation in SQL, this needs to match the definitions further down in the format string and the column definition needs to be suitable.
The next three lines dictate the connector you’re using, the target files, the error name for use with ipe_conv_error and invalid record handling.
The last part is telling the external table that we’ll be reading from the JSON format. This is done by setting fmt_json to 1 and then we need to set the format string to tell it what to look for. The APPLY() function used takes a list of functions and applies them to every column and the firstvalid() returns the first evaluated argument which does not cause an error. So “APPLY(firstvalid(?,null))” applies the firstvalid() function to every column where it nulls an entry in case it doesn’t exist instead of erroring. Lastly we define the columns we want it to look for. The inputbasename() and objectnumber() will put the filename and object number into columns which we’ve defined as sourcefile and object_id further above. Then we have JSON names themselves. Obviously JSON files can contain quite some depth so if you need to access deeper entries, so if you had a JSON file containing student details you use “Student[].year_group” which will get the year group from the student array. If year_group was another array inside “Student[]” then you can extend it in the same way, “Student[].year_group[].name”.
Now we can access the data. External tables appear alongside regular tables in the schemas but have a red icon instead, hovering over one with the mouse cursor will identify it as one:

External table in schema

What you might have noticed is that this depends on you knowing what’s inside your JSON file i.e. the entry names and with larger files, searching through them can be quite daunting. A feature to aid in this discovery would be prettyprint() which will return a readable view of what you supply to it e.g. prettyprint(address). This is defined with the rest of the column definitions too after format. Let’s try it on the address in our sample JSON file using an inline external table (returns results without saving it as an actual table):

external table (
    sourcefile varchar(500)
    ,object_id int
    ,first_name varchar(20)
    ,last_name varchar(20)
    ,c_age int
    ,addr_contents varchar(1000)
)
from my_test_connector
target 'file "sample.json"
,conv_error_name "TEST_S3", ignore_invalid_records
,fmt_json 1, format "APPLY(firstvalid(?,null))
    ,inputbasename()
    ,objectnumber()
    ,firstName
    ,lastName
    ,age
    ,prettyprint(address)
';

Pretty printing with JSON files

Now we can see the contents of the address and can pull things from there into their own columns. Try it for the phone number, you will need to add “[]” as it’s an array.
A tip for using data from external tables: the connector still needs to connect to S3 and parse the JSON data on the fly which can be slow considering the amount of variables involved such as connections, proxies, amount of data etc. This slow access certainly won’t be pleasant when we want to use it so what we can do is create a view image of this table.

create view <your_schema>.<your_view> as
    select * from <your_schema>.<your_table>;
create view image <your_schema>.<your_view>;

The view image is essentially a snapshot of the external table data in memory which can be renewed for newer data by simply recreating it. This can be done manually but it’s recommended that you create a bash script to submit the query via wxsubmit and then schedule this to run hourly/daily/weekly via something like cron. It’s also a good idea to do any sort of cleaning or transforming at this view creation stage instead of “select *” so that it’s ready to use.

Next time we’ll use this with a much larger data set including visuals with Tableau and insights with external scripts.

The loneliest railway station in Britain

18

May
2017
Posted By : Graeme Cole Comments are off
Tags :  ,,

In my last blog post, I introduced Kognitio’s ability to flatten complex JSON objects for loading into a table. Today we’ll look at another example using real-world Ordnance Survey data. We’ll also look at what you can do if the JSON files you need to load are in HDFS. We’ll use these techniques to solve the following example problem…

Which railway station in mainland Britain is the furthest straight-line distance from its nearest neighbour? The fact that the answer is Berwick-upon-Tweed may surprise you!

(more…)

Hadoop… Let’s not throw the baby out with the bath water again!

03

May
2017
Posted By : Roger Gaskell Comments are off
rubbish
Categories :#AnalyticsNews, Blog
Tags :  ,,

Here we go again! Suddenly the industry seems to have turned on Hadoop. Headlines saying “it’s hit the wall” and “it’s failed” have recently appeared and some are suggesting that organisations look at alternative solutions. Granted, Hadoop has its limitations and has not lived up to the massive hype that surrounded it a year or two ago, but then nothing ever does.

I admit I was not a fan of Hadoop when it first appeared; it seemed like a step backwards. It was very complicated to install, unreliable, and difficult to use, but still it caught the industry’s imagination. Engineers liked it because it was “proper engineering” not a shrink wrapped productionised product, and the business was seduced by the idea of free software. Pretty quickly it became an unstoppable runaway train and the answer “to life the universe and everything” was no longer 42 but Hadoop.

Great expectations generally lead to disappointment and this is Hadoop’s problem. We hyped it up to such an extent that it was always going to be impossible for it to live up to the expectations, no matter how much it improved, and it has, immeasurably! Hadoop is following the Gartner Hype Cycle (one of the cleverest and most accurate representations of how the perception of technology evolves) perfectly. It’s just for Hadoop the curve is enormous!

So what do I mean by let’s not throw out the baby with the bathwater again? In Hadoop’s early days the hot topic was NoSQL. The message was SQL was dead. The problem with SQL was that it was difficult to write the complicated mathematical algorithms required for Advanced Analytics and, as the name suggests, it relies on the data having structure. Advanced analytical algorithms are easier to implement, and unstructured data easier to handle, in languages such as “R” and Python. All perfectly true, but advanced analytics is just the tip of the data analytics triangle and the rest of the space is very well served by traditional SQL. Traditional BI reporting and self-service data visualization tools are still massively in demand and generally use SQL to access data. Even unstructured data is usually processed to give it structure before it is analysed. So when the NoSQL bandwagon claimed SQL was dead, they were effectively throwing out the most widespread and convenient method of business users getting access to Hadoop based data, in favor of something that only developers and data scientists could use.

Of course sense eventually prevailed, NoSQL morphed into Not-Only SQL, and now everyone and his brother is now implementing SQL on Hadoop solutions. The delay has been costly though and the perceived lack of fully functional, high performance SQL support is one of the key reasons why Hadoop is currently under pressure. I say perceived because there are already very good SQL on Hadoop solutions out there if people are willing to look outside the Apache box, but this is not a marketing piece so I will say no more on that subject. My point is that the IT industry has a history of using small weaknesses to suddenly turn on otherwise very useful technologies. There will always be those whose interests are best served by telling the industry that something is broken and we need to throw it away and start again. The IT industry’s problem is that it is often too easily led astray by these minority groups.

Hadoop has come a long way in a short time and although it has problems there is a large community of people working to fix them. Some point to the lack of new Apache Hadoop projects as a sign of Hadoop’s demise; I would argue that this is a positive thing with the community now focused on making and finding stuff that works properly rather constantly focusing on the shiniest, new cool project! I think that Hadoop is finally maturing.

This post first appeared on LinkedIn on April 18, 2017.

External tables and scripts and user mapping

27

Apr
2017
Posted By : Ben Cohen Comments are off
external scripts, tables, user mapping, sql query
Categories :Blog, Kognitio How To

Introduction

Kognitio has two mechanisms for external programs to provide or operate on data during an SQL query: external tables and external scripts. These can be written in the user’s preferred language, for example R, Python, shell or even C.

External tables are used like normal tables in SQL queries but the rows are supplied by an external “connector” program rather than being stored by WX2. The connector could generate the rows itself (TPC-H data, for example) or fetch them from another database, HDFS or an internet data source.

External scripts are table functions in SQL which take a subquery “argument” and return a table. They allow users to create complex analytics or achieve functionality that the database itself doesn’t provide. The script is run under the “script environment” interpreter with the subquery rows as input, and the output rows will be returned to the SQL query.

Here is an example of an external table and an external script. There are more examples in
wx2/current/examples/excon_* and wx2/current/examples/exscripts.sql under your WX2 install directory.

-- External table to return the current environment for each connector instance.
create connector EXCON_ENV
    -- adjust this path if required
    command '/opt/kognitio/wx2/current/examples/excon_env';
create external table ENV (CONNECTOR_NUM int, 
                           NAME varchar(10000),
                           VALUE varchar(10000))
    from EXCON_ENV;

-- Results: environment of each connector
--          e.g. (0, 'WX2_TNO', '629'), (0, 'WX2_METHOD', 'LOAD'), ...
select * from ENV;

-- External script to perform an arbitrary regular expression substitution
-- using the stream editor, sed.
create script environment SED command '/bin/sed -f';
create external script REGEXP environment SED
    receives (TEXT varchar) input 'format "%{quoted=0}s\\n"'
    sends (SUBST_TEXT varchar) output 'format "%0c\\n"'
    script 's/ko\(.*\)/i\1n/';

-- Results: ('ignition')
external script REGEXP from (select 'kognitio');

Kognitio on Hadoop

The following sections on user mapping apply only to Kognitio running in appliance mode.

For Kognitio on Hadoop, external connectors and scripts currently run as the same user that runs the Kognitio on Hadoop processes. This is expected to change in the future to a broadly similar scheme tailored to Hadoop.

Why map users?

The examples above are simple so it doesn’t matter which Linux user runs the external script or connector processes, but it can do in practice. You might want your script to run as a user that has already been set up, perhaps with SSH keys or a uid needed for NFS.

It is also preferable not to execute user-supplied code as a privileged user – or in this case one with access to database files.

Kognitio chooses the Linux user based on the Kognitio user and the query context. By default – unless it has been configured otherwise as described below – it will use the user wxextern, which is created when WX2 is installed. (But wxextern is not created automatically by upgrade from versions earlier than 8.1.)

Allowed Linux users

Kognitio doesn’t simply allow scripts or connectors to run as any Linux user. By default only wxextern is allowed. Any other user must be allowed by name or uid (user id) by editing the “[security]” section of the config file as follows.

A Linux user is allowed if:

  1. The user name is listed in the config file setting “ext_users_allowed” comma-separated list of user names; or
  2. The config file settings “min_ext_user_id” and “max_ext_user_id” are set to positive numbers with min_ext_user_id <= uid <= max_ext_user_id, and the user must not be listed in “ext_users_denied” comma-separated list of user names.

(By default ext_users_allowed is “$extern” and ext_users_denied is “root“. Strictly, in (2) only one of min_ext_user_id and max_ext_user_id needs to be non-negative, with the inequality ignored for the negative value.)

You need to restart WX2 (“wxserver start“) after changing any of these config file settings for them to take effect.

User map strings

Kognitio attempts to map from a Kognitio user to a Linux user using a user map string. This is a comma-separated list of any of the following elements:

User map element Linux user to match
name User name
& Same as Kognitio user (e.g. SYS)
$extern Kognitio extern user, by default wxextern
$admin Kognitio admin user, by default wxadmin
$master Kognitio master user, by default wxroot
$owner Kognitio owner user, by default root

It will try each name in turn and will use the first one which, when converted to lower case, gives the name of a Linux user which is allowed by the security parameters. If no valid match is found then it will fail with the error “ET0111: User mapping for the external operation was invalid”.

For example, for the user map string “kognitio,&,$extern” for a Kognitio user called MEPHISTOPHELES it will try Linux users kognitio, mephistopheles and wxextern.

Note that this is evaluated separately on each node, so it is recommended that the Linux user names and uids are synchronised across the cluster. Otherwise it is possible for different user names to be matched, the same name being matched but the uids differing, or names being matched on only a subset of nodes (the ET0111 error will be raised in this case).

If an ET0111 error is raised the Kognitio user and the user map string are logged, as of version 8.2, in the server debug log file (look at `wxlogd smd`/serverdbg.*), which should be helpful to see why it didn’t work. For example:

T_2017-04-21_09:06:12_BST: ET: Invalid user mapping (ET0111) on 'hp-rack2-enc2-11' for 'SYS' -> 'kognitio,&,$extern'

Which user map string is used?

When Kognitio is about to launch an external script or connector it obtains a user map string. It first looks in IPE_EXTERNAL_USER_MAP for a user map string based on the execution context; if it doesn’t find one then it uses at the config file setting “[boot options] default_usermap“, which defaults to “&,$extern“. It then tries to map the current Kognitio user to a Linux user as described above.

For external scripts the search is performed in this order matching against the columns (OBJECT_TYPE, OBJECT_ID, USER_ID) of IPE_EXTERNAL_USER_MAP.

PRIVDOMAIN_EXT_SCRIPT(13)Script Id-1PRIVDOMAIN_SCRIPT_ENV(10)Script Environment IdUserIdPRIVDOMAIN_SCRIPT_ENV(10)Script Environment Id-1PRIVDOMAIN_EXT_SCRIPT(13)-1User IdPRIVDOMAIN_EXT_SCRIPT(13)-1-1-1-1User Id

OBJECT_TYPE OBJECT_ID USER_ID
PRIVDOMAIN_EXT_SCRIPT(13) Script Id UserId
-1 -1 -1

For external tables the search is performed in this order matching against the columns (OBJECT_TYPE, OBJECT_ID, USER_ID) of IPE_EXTERNAL_USER_MAP.

PRIVDOMAIN_TABLE(3)Table Id-1PRIVDOMAIN_CONNECTOR(9)Connector IdUser IdPRIVDOMAIN_CONNECTOR(9)Connector Id-1PRIVDOMAIN_TABLE(3)-1User IdPRIVDOMAIN_TABLE(3)-1-1-1-1User Id

OBJECT_TYPE OBJECT_ID USER_ID
PRIVDOMAIN_TABLE(3) Table Id User Id
-1 -1 -1

The administrator can add and remove mappings in IPE_EXTERNAL_USER_MAP using this special syntax.

CREATE EXTERNAL USER MAP
  [OBJECT_TYPE=[EXTERNAL SCRIPT|SCRIPT ENVIRONMENT|EXTERNAL TABLE|CONNECTOR],]
  [OBJECT_ID = 1234,]
  [USER_ID = 1234,]
  USERMAP = '<user map>';

DROP EXTERNAL USER MAP
  [OBJECT_TYPE=[EXTERNAL SCRIPT|SCRIPT ENVIRONMENT|EXTERNAL TABLE|CONNECTOR],]
  [OBJECT_ID = 1234,]
  [USER_ID = 1234];

Hadoop’s biggest problem, and how to fix it

05

Apr
2017
Posted By : Mark Chopping Comments are off
Tags :  ,

Introduction

Hadoop was seen as a silver bullet for many companies, but recently there has been an increase in critical headlines like:

  1. Hadoop Has Failed Us, Tech Experts Say
  2. You’re doing Hadoop and Spark wrong, and they will probably fail
  3. Has Hadoop Failed? That’s the Wrong Question

The problem

Dig behind the headlines, and a major issue is the inability for users to query data in Hadoop in the manner they are used to with commercial database products.

From the Datanami article:

  • Hadoop’s strengths lie in serving as a cheap storage repository and for processing ETL batch workloads, Johnson says. But it’s ill-suited for running interactive, user-facing applications
  • It’s better than a data warehouse in that have all the raw data there, but it’s a lot worse in that it’s so slow
  • “At the Hive layer, it’s kind of OK. But people think they’re going to use Hadoop for data warehouse…are pretty surprised that this hot new technology is 10x slower that what they’re using before,” Johnson says. “[Kudo, Impala, and Presto] are much better than Hive. But they are still pretty far behind where people would like them to be.”

The Register article based on a Gartner research talk recognises Hadoop’s strength for ETL processing, but highlights the issues with SQL-handling on Hadoop.

The Podium Data article states “Hadoop is terrible as a relational database”, and “Hadoop failed only in the sense that inflated expectations could never be met compared to mature commercial offerings.”

“The Growing Need for SQL for Hadoop” talks about the need for SQL for Hadoop. The ideal is to be “on Hadoop”, and thus processing data within the Hadoop cluster, rather than “off Hadoop” where data has to be extracted from Hadoop for processing.

Similarly, Rick van der Lans talks about “What Do You Mean, SQL Can’t Do Big Data?”, emphasising the need for SQL solutions when working with big data platforms.

RCA of the problem

There can be many reasons for current SQL-on-hadoop products not being performant.

Possibilities include:

  • overhead of starting and stopping processes for interactive workloads – to run relatively simple queries quickly, you need to reduce latency. If you have a lot of overhead for starting and stopping containers to run tasks, that is a big impediment to interactive usage, even if the actual processing is very efficient
  • product immaturity – a lot of commercial databases have built on the shoulders of giants. For example, this wiki lists a set of products that derive from PostgreSQL, including Greenplum, Netezza, ParAccel, Redshift, Vertica. This gives these products a great start in avoiding a lot of mistakes made in the past, particularly in areas such as SQL optimisation. In contrast, most of the SQL-on-hadoop products are built from scratch, and so developers have to learn and solve problems that were long-since addressed in commercial database products. That is why we see great projects like Presto only starting to add a cost-based optimiser now, and Impala not being able to handle a significant number of TPC-DS queries (which is why Impala TPC-DS benchmarks tend to show less than 80 queries, rather than the full 99 from the query set).
  • evolution from batch processing – if a product like Hive starts off based on Map-Reduce, its developers won’t start working on incremental improvements to latency, as they won’t have any effect. Similarly, if Hive is then adopted for a lot of batch processing, there is less incentive to work on reducing latency. Hive 2 with LLAP project aims to improve matters in this area, but in benchmarks such as this AtScale one reported by Datanami it still lags behind Impala and SparkSQL.

Solution

Whilst benchmarks show that SQL on Hadoop solutions like Hive, Impala and SparkSQL are all continually improving, they still cannot provide the performance that business users need.

Kognitio have an SQL engine originally developed for standalone clusters of commodity servers, and used by a host of enterprise companies. Due to this heritage, the software has a proven history of working effectively with tools like Tableau and MicroStrategy, and delivering leading SQL performance with concurrent query workloads – just the sort of problems that people are currently trying to address with data in Hadoop. The Kognitio SQL engine has been migrated to Hadoop, and could be the solution a lot of users of Hive, Impala and SparkSQL need today.

It has the following attributes:

  • free to use with no limits on scalability, functionality, or duration of use
  • mature in terms of query optimisation and functionality
  • performant, particularly with concurrent SQL query workloads
  • can be used both on-premise and in the cloud

For further information about Kognitio On Hadoop, try:

 

This post first appeared on LinkedIn on March 23, 2017.

Simple performance checks against your hardware cluster

24

Mar
2017
Posted By : Simon Darkin 1 Comment
performance hardware cluster, cpu. benchmarks

Kognitio have a lot of experience commissioning clusters of new hardware for our MPP software product. As part of that process, we’ve developed a number of steps for validating the performance of new clusters, and these are the topic of this blog entry.

 

There are many Linux based benchmarking tools on the market however they are not usually installed by default, in which case some simple command line tools can be used to quickly establish if there is a potential hardware issue that warrants further investigation.    The following hardware components are covered by this topic:

  • CPU
  • Disk
  • Networking
  • RAM

 

CPU   

A slow CPU or core could have an adverse effect on query performance, and so with the use of basic command line tools you can help identify laggards.  A background ‘for’ loop can be employed to ensure all cores/threads are tested simultaneously.

 

Integer arithmetic test

Invoked 8 times to run simultaneously against 8 cores

 

for i in `seq 1 8`; do time $(i=0; while (( i < 999999 )); do (( i ++ )); done)& done; wait

 

this will return the time taken to increment an integer over the specified range.  A comparison of the time taken by each core will help identify outliers

 

real    0m8.908s
user    0m8.765s
sys     0m0.140s

real    0m8.943s
user    0m8.789s
sys     0m0.156s

real    0m8.997s
user    0m8.761s
sys     0m0.112s

real    0m9.000s
user    0m8.853s
sys     0m0.144s

real    0m9.023s
user    0m8.881s
sys     0m0.140s

real    0m9.028s
user    0m8.861s
sys     0m0.168s

real    0m9.034s
user    0m8.857s
sys     0m0.176s

real    0m9.073s
user    0m8.781s
sys     0m0.156s

 

Whilst the test is running you can check that each core is under load by running top and expanding the output to show all cores.  If you do encounter outliers in the arithmetic test then you can use the output from top to identify which core(s) remain busy when others have finished

 

Cpu0  : 98.3%us,  1.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu1  : 99.0%us,  1.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu2  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu3  : 99.3%us,  0.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu4  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu5  : 98.0%us,  2.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu6  : 98.3%us,  1.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu7  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

 

Compression test

As with the arithmetic test this example loops around 8 times so that 8 cores are tested simultaneously.  Data is written to /dev/null to avoid any overhead associated with disk IO.

 

for i in `seq 1 8`; do dd if=/dev/zero bs=1000 count=1000000 | gzip >/dev/null&  done; wait

 

this will return the rate at which each core is able to compress 1 GB of data

 

1000000000 bytes (1.0 GB) copied, 11.9277 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9277 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9545 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9799 seconds, 83.5 MB/s

1000000000 bytes (1.0 GB) copied, 11.9831 seconds, 83.5 MB/s

1000000000 bytes (1.0 GB) copied, 12.0085 seconds, 83.3 MB/s

1000000000 bytes (1.0 GB) copied, 12.0382 seconds, 83.1 MB/s

1000000000 bytes (1.0 GB) copied, 12.2655 seconds, 81.5 MB/s

 

With Kognitio software installed you can use the wxtool command  to run the compression test simultaneously against all database nodes to aid comparison across the cluster as a whole.  You can download the software for free at http://kognitio.com/free-download/

 

wxtool -a '{can DB}' -S 'for i in `seq 1 8`; do dd if=/dev/zero bs=1000 count=1000000 | gzip >/dev/null&  done; wait'

 

For node kap1-1 (ecode 0, 866 bytes):

1000000000 bytes (1.0 GB) copied, 11.9422 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9659 seconds, 83.6 MB/s

1000000000 bytes (1.0 GB) copied, 11.9876 seconds, 83.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.0142 seconds, 83.2 MB/s

1000000000 bytes (1.0 GB) copied, 12.1293 seconds, 82.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.3754 seconds, 80.8 MB/s

1000000000 bytes (1.0 GB) copied, 12.4132 seconds, 80.6 MB/s

1000000000 bytes (1.0 GB) copied, 12.4386 seconds, 80.4 MB/s

For node kap1-3 (ecode 0, 864 bytes):

1000000000 bytes (1.0 GB) copied, 11.8398 seconds, 84.5 MB/s

1000000000 bytes (1.0 GB) copied, 11.8661 seconds, 84.3 MB/s

1000000000 bytes (1.0 GB) copied, 11.8893 seconds, 84.1 MB/s

1000000000 bytes (1.0 GB) copied, 11.9165 seconds, 83.9 MB/s

1000000000 bytes (1.0 GB) copied, 11.946 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.953 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9637 seconds, 83.6 MB/s

1000000000 bytes (1.0 GB) copied, 12.2996 seconds, 81.3 MB/s

For node kap1-3 (ecode 0, 866 bytes):

1000000000 bytes (1.0 GB) copied, 11.8757 seconds, 84.2 MB/s

1000000000 bytes (1.0 GB) copied, 11.8846 seconds, 84.1 MB/s

1000000000 bytes (1.0 GB) copied, 11.9178 seconds, 83.9 MB/s

1000000000 bytes (1.0 GB) copied, 11.9243 seconds, 83.9 MB/s

1000000000 bytes (1.0 GB) copied, 11.9377 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9834 seconds, 83.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.3367 seconds, 81.1 MB/s

1000000000 bytes (1.0 GB) copied, 12.3942 seconds, 80.7 MB/s

For node kap1-4 (ecode 0, 864 bytes):

1000000000 bytes (1.0 GB) copied, 11.91 seconds, 84.0 MB/s

1000000000 bytes (1.0 GB) copied, 11.9291 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9448 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9498 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 12.1232 seconds, 82.5 MB/s

1000000000 bytes (1.0 GB) copied, 12.3896 seconds, 80.7 MB/s

1000000000 bytes (1.0 GB) copied, 12.4449 seconds, 80.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.4504 seconds, 80.3 MB/s

 

 

 

DISKS

Having just one underperforming disk in the system can significantly impact query performance against disk based tables. Here are some simple tests to help identify any anomalies.

 

Iterative write speed test with dd.

 

WARNING: THIS WRITE TEST WILL OVERWRITE DATA AT THE SPECIFIED DESTINATION AND SO MAKE SURE YOU AVOID AREAS THAT CONTAIN SYSTEM DATA OR USER DATA THAT YOU WISH TO KEEP

 

for i in `seq 1 3`; do echo "Loop $i"; dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=10000 count=100000 conv=fsync; echo ""; done

 

this will return the duration and rate at data can be written out to disk.  In this example 1 GB of data is repeatedly written to a raw partition.  Note that fsync is used to flush the writeback cache and ensures data is written to the physical media.

 

Loop 1
100000+0 records in
100000+0 records out
1000000000 bytes (1.0 GB) copied, 13.6466 seconds, 73.3 MB/s

Loop 2
100000+0 records in
100000+0 records out
1000000000 bytes (1.0 GB) copied, 12.8324 seconds, 77.9 MB/s

Loop 3
100000+0 records in
100000+0 records out
1000000000 bytes (1.0 GB) copied, 12.4271 seconds, 80.5 MB/s

 

With Kognitio software installed, the test can be expanded to run on all database nodes allowing for easy comparison of all disks in the system

 

wxtool -a '{can DB}' -S 'for i in `seq 1 3`; do echo "Loop $i"; dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=10000 count=100000 conv=fsync; echo ""; done'

 

Iterative read speed test with dd

 

for i in `seq 1 3`; do let skip=$i*5000; echo "Loop $i - skip = $skip"; sync ; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/cciss/c0d0p2 of=/dev/null bs=1000 count=1000000 skip=$skip ;echo ""; done

 

this will return the rate at which data can be read from disk.  In this example 1 GB of data is being read from a raw partition, adjusting the offset and flushing the buffer on each iteration to ensure data is being read from the physical media.

 

Loop 1 - skip = 5000
1000000+0 records in
1000000+0 records out
1000000000 bytes (1.0 GB) copied, 14.4355 seconds, 69.3 MB/s

Loop 2 - skip = 10000
1000000+0 records in
1000000+0 records out
1000000000 bytes (1.0 GB) copied, 12.9884 seconds, 77.0 MB/s

Loop 3 - skip = 15000
1000000+0 records in
1000000+0 records out
1000000000 bytes (1.0 GB) copied, 12.6045 seconds, 79.3 MB/s

 

With Kognitio software installed, the test can be expanded to run on all database nodes to aid comparison across the entire system.

 

wxtool -a '{can DB}' -S 'for i in `seq 1 3`; do let skip=$i*5000; echo "Loop $i - skip = $skip"; sync ; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/cciss/c0d0p2 of=/dev/null bs=1000 count=1000000 skip=$skip ;echo ""; done'

 

Iterative read speed test with hdparm

 

for i in `seq 1 3`; do echo "Loop $i"; hdparm --direct -t /dev/cciss/c0d0p2; echo ""; done

 

this will return the rate at which data can be read sequentially from disk without any file system overhead.

 

Loop 1
/dev/cciss/c0d0p2:
Timing O_DIRECT disk reads:  236 MB in  3.01 seconds =  78.40 MB/sec

Loop 2
/dev/cciss/c0d0p2:
Timing O_DIRECT disk reads:  236 MB in  3.02 seconds =  78.09 MB/sec

Loop 3
/dev/cciss/c0d0p2:
Timing O_DIRECT disk reads:  230 MB in  3.01 seconds =  76.30 MB/sec

 

With Kognitio software installed, the test can be expanded to run on all database nodes to aid comparison across the entire system.

 

wxtool -a '{can DB}' -S 'for i in `seq 1 3`; do echo "Loop $i"; hdparm --direct -t /dev/cciss/c0d0p2; echo ""; done'

 

Disk based table scan

 

If the cluster is running Kognitio database software you can initiate a scan of a large disk based table and review the output from wxtop in order to spot any disk store processes that remain busy for a significant period after others have finished.   For accurate results you should ensure there is no concurrent activity when performing this test.

 

select *
from <large disk based table>
where <condition unlikely to be true>;

 

Monitor the output from wxtop and look out for any disk store processes that remain busy when all or most others have finished.

 

PID       NODE        PROCESS                           SIZE      TIME
15784       kap1-1      WXDB(55): Diskstore             258036       100
22064       kap1-2      WXDB(18): Diskstore             257176        86
25179       kap1-3      WXDB(73): Diskstore top         258200        84
31237       kap1-4      WXDB(37): Diskstore             258068        77

 

If a disk store process does appear to lag behind, then you should eliminate the possibility of it being attributable to data skew by checking the row counts across all of the disks using the following query

 

select
mpid,
sum(nrows) nrows
from ipe_ftable
where table_id = <table_id being scanned>
group by 1
order by 2 desc;

 

 

NETWORKING

You can test the network links between nodes using some simple netcat commands.  This will allow you to spot links that are underperforming.

 

Link speed test using dd and netcat

 

The name and options associated with the netcat binary will depend on the Linux installation, however with Kognitio software installed you can use wxnetread and wxnetwrite for the data transfer regardless.

 

Setup a listening process on the node performing the read

 

netcat -l -p 2000 > /dev/null &

 

use dd to generate some data and pipe through netcat to the IP and port of the node performing the read

 

dd if=/dev/zero bs=1000 count=1000000 | netcat 14.4.21.2 2000

 

this will return the rate at which data can be copied over the link

 

1000000000 bytes (1.0 GB) copied, 8.54445 seconds, 117 MB/s

 

The same test as above this time using wxnetread/wxnetwrite

 

wxnetread -l -p 2000 > /dev/null &

 

dd if=/dev/zero bs=1000 count=1000000 | wxnetwrite 14.4.21.2 2000

 

1000000000 bytes (1.0 GB) copied, 8.5328 seconds, 117 MB/s

 

Shape tests

With Kognitio software installed you can run Shape tests to measure the speed at which RAM based data is re-distributed between nodes

 

wxtester -s <dsn> -u sys -p <password> -Ishape 5000 9000 1

 

Once the tests have been running for a few minutes you can navigate to the logs directory and check the data rate

 

cd `wxlogd wxtester`
grep TSTSHN results | gawk '{ if ($5==64) print ((($3*$5)/$6)/<number of database nodes>)/1048576 }'

 

With older generation hardware you can expect to see performance of  40MB/s/node given sufficient network bandwidth.       With newer hardware, for example HP Gen9 servers with 2x 56Gb/s links per node this increases to 90MB/s/core.

 

 

RAM

Benchmarking RAM performance is best left to a dedicated test suite, however you can perform a very simple write/read speed test using dd in conjunction with a temporary file storage facility in RAM, which at the very least can show up a mismatch in performance between nodes.

 

Write and read speed test using dd

 

mkdir RAM
mount tmpfs -t tmpfs RAM
cd RAM
dd if=/dev/zero of=ram_data bs=1M count=1000

 

this will return the rate at which data is written to RAM

 

1048576000 bytes (1.0 GB) copied, 1.06452 seconds, 985 MB/s

 

dd if=ram_data of=/dev/null bs=1M count=1000

 

this will return the rate at which data is read from RAM

 

1048576000 bytes (1.0 GB) copied, 0.6346 seconds, 1.7 GB/s

 

Strata + Hadoop World – San Jose

22

Mar
2017
Posted By : Sharon Kirkham Comments are off
Categories :#AnalyticsNews, Blog

The Kognitio team had a great trip to Strata + Hadoop World in San Jose last week and we would like to say a big thank you to everyone who stopped by for a chat about getting enterprise level performance for their SQL on Hadoop. We look forwarding to hearing from you when you try out Kognitio on Hadoop.

At the start of the conference we released our benchmarking whitepaper in which Kognitio outperformed Impala and Spark in a TPC-DS benchmarking exercise. This proved to be of great interest and kept us all really busy on the stand. Conversations ranged from people who have been using Hadoop a while and are having problems serving data to their end-user applications such as Tableau and Qliksense right through to those that are just starting out on their Hadoop journey and wanted to understand what Kognitio can bring to their solution stack.

The subject matter of the conference sessions indicates that there is a period of consolidation going on within the Apache® Hadoop® solution stack. Most topics were discussing how to get the most from more established projects and the challenges of enterprise adoption. There was very little new research presented which was a bit disappointing.

 

Marcel Kornacker and Mostafa Mokhtar from Cloudera presented a talk on optimising Impala performance that was really interesting. They had also been using the TPC-DS query set for benchmarking but obviously had to use a cut down version of the query set (75 out of 99 queries). The optimisation details will be useful for us to follow for Impala when we do the next round of benchmarking after Kognitio 8.2 is released in April. Their benchmarks were at the 1 TB and 10TB scale. Increasing scale to 10TB and concurrency above 10 streams is something that we would definitely like to do during the next set of benchmarks.

From a maths perspective it was great to see Bayesian inference in the data science mix. Michael Lee Williams from Fast Forward Labs presented a great overview. I will certainly be checking out some of algorithms and tools with a view to parallelising them within Kognitio’s external scripting framework.

Data streaming also continues to be at the forefront of the conference . It was clear from the number of sessions in the conference that more companies (such as Capital One) have experiences they want to share as well as plenty of contributions from established technology leaders such as Confluent. It is certainly something that we are thinking about here.

If you didn’t make it to our booth at San Jose we hope to see you at one of these upcoming events:

DWS17, Munich, Sponsor, Big Data

We’ll be on Booth #1003.

See us at the next Strata Data Conference in London

23-25 May 2017

Booth #511.

 

Participate in the Kognitio Console beta test program

10

Mar
2017
Posted By : Michael Atkinson Comments are off
Kognitio console, beta test program
Categories :Blog, Kognitio How To

Kognitio Console is Kognitio’s client side management program for the Kognitio Analytical Platform.

Some of its features are:

  • It allows inspection of the metadata tree for schemas, tables, views, external scripts, users, connectors, sessions, queues, etc.
  • It also gives an object view to each of these metadata objects, allowing their inspection and management.
  • There are also lots of tools, wizards and widgets to browse data in Hadoop, load and unload data, identify problem queries and many more.
  • There are also a set of reports and dashboards to monitor the state of Kognitio systems.
  • Macros may be written to extend Kognitio Console, the reports and dashboards are written in these XML macros.
  • Ad-Hoc queries may be executed.
  • KogScripts may be executed and debugged. KogScript is based on Lua but has Kognitio specific enhancements to run SQL natively.
  • It is an integrated development environment (IDE) for KogScripts, and external scripts running in Kognitio.

All this makes Kognitio Console a versatile tool, suitable for database admins, analysts and power users.

Kognitio Console is in constant development, beta and release candidate versions are provided from our update website. Beta and release candidates will be announced on the Kognitio Console forums

There are two ways of obtaining these betas:

Note that this is beta software, so has not gone through a full QA. However, we endeavour to make sure it has no obvious bugs and it will have passed our Console smoke tests. If you experience crashes or other bugs or deficiencies in these beta and release candidates, please report them to us, this will help us make the next release more stable.

By installing it into a different location than the default, you can have both this and the last stable releases installed at the same time.

Facebook

Twitter

LinkedId