Visiting my neighbour S3 and his son JSON


Posted By : Chak Leung Comments are off

They also live across the river and there’s no bridge…so we’ll just make our own!

Amazon’s S3 is a popular and convenient storage solution which many, especially those with big data, tend to utilise, and the challenge can be connecting to this large store that has been building up over days/weeks/months/years. There are many ways to do this, no doubt, you could do a curl/wget but it’s in its raw form and converting it for use with databases isn’t always simple.
With Kognitio external connectors and tables you can connect to and parse it on the fly.

Let’s see how we can do this with JSON data and external tables in Kognitio where we’ll be able to use externally stored data as if they were local tables. We’ll also be utilising this in a later blog post where we’ll have some larger data.

Why would you want to use this though? What are the benefits?

•  Access data stored elsewhere without having to physically move them - streamlining the ETL process
•  Use data as if it were stored locally - no need to rewrite proprietary scripts and processes
•  Easily update with the newest data and improving access times using memory images

Convinced? Great! here’s what you’ll need to get started:

•  An S3 bucket with JSON data
•  Access and secret keys to S3 bucet
•  Kognitio with support for external tables

You can read more about external tables in chapter 8 of the Kognitio guide here.

In a nutshell the three typical components to this process are; stored data, a connector and an external table.

Let’s try it with our sample JSON file containing a customer’s form data.

    "firstName": "John",
    "lastName": "Smith",
    "age": 25,
        "streetAddress": "21 2nd Street",
        "city": "New York",
        "state": "NY",
        "postalCode": "10021"
            "type": "home",
            "number": "212 555-1234"
            "type": "fax",
            "number": "646 555-4567"


The first step is to put this into S3, you can use Bash’s curl or something like the boto package in Python to do this. You can also use something like S3 browser to confirm that the file is in our bucket.

Now we can start building connections with external tables in Kognitio to access them. External tables don’t actually import the data from their source but bridge a connection to them. This enables us to access the data without having to go through arduous ETL processes to include the newest batch.

So firstly we need the bridge, an S3 based connector:

create connector my_test_connector source s3
target 'bucket <your_bucket> ,
accesskey <your_s3_access_key> ,
secretkey <your_s3_secret_key> ,
connectors_on_nodes "<node_name>",
proxy "<your_proxy_address>:<port>"';

The connectors_on_nodes is only if your nodes have restricted external access and you need to tell the connector which one can connect to the outside world. Same kind of situation with the proxy, use if required. Once you’ve created your connector, you can see it in the systems pane:

Connector in schema

Let’s see if our connector works, run:

external table from my_test_connector target 'list sample.json';

Files in S3

And we can see the file we placed in S3 and other information about it. Note that you can use the asterisk “*” like in Bash to include all JSON files e.g. “*.json”. If you have JSON files holding different information, you can always tag them with a prefix e.g. “prefix1*.json”. This will be useful when you need to build several external tables with different JSON sources.

Creating an external table uses the following syntax:

create external table . (
    sourcefile varchar(500)
    ,object_id int
    ,first_name varchar(20)
    ,last_name varchar(20)
    ,c_age int
from my_test_connector
target 'file "sample.json"
    ,conv_error_name "TEST_S3", ignore_invalid_records
    ,fmt_json 1, format "APPLY(firstvalid(?,null))

The first part of the query encased in brackets defines the columns similar to typical table creation in SQL, this needs to match the definitions further down in the format string and the column definition needs to be suitable.
The next three lines dictate the connector you’re using, the target files, the error name for use with ipe_conv_error and invalid record handling.
The last part is telling the external table that we’ll be reading from the JSON format. This is done by setting fmt_json to 1 and then we need to set the format string to tell it what to look for. The APPLY() function used takes a list of functions and applies them to every column and the firstvalid() returns the first evaluated argument which does not cause an error. So “APPLY(firstvalid(?,null))” applies the firstvalid() function to every column where it nulls an entry in case it doesn’t exist instead of erroring. Lastly we define the columns we want it to look for. The inputbasename() and objectnumber() will put the filename and object number into columns which we’ve defined as sourcefile and object_id further above. Then we have JSON names themselves. Obviously JSON files can contain quite some depth so if you need to access deeper entries, so if you had a JSON file containing student details you use “Student[].year_group” which will get the year group from the student array. If year_group was another array inside “Student[]” then you can extend it in the same way, “Student[].year_group[].name”.
Now we can access the data. External tables appear alongside regular tables in the schemas but have a red icon instead, hovering over one with the mouse cursor will identify it as one:

External table in schema

What you might have noticed is that this depends on you knowing what’s inside your JSON file i.e. the entry names and with larger files, searching through them can be quite daunting. A feature to aid in this discovery would be prettyprint() which will return a readable view of what you supply to it e.g. prettyprint(address). This is defined with the rest of the column definitions too after format. Let’s try it on the address in our sample JSON file using an inline external table (returns results without saving it as an actual table):

external table (
    sourcefile varchar(500)
    ,object_id int
    ,first_name varchar(20)
    ,last_name varchar(20)
    ,c_age int
    ,addr_contents varchar(1000)
from my_test_connector
target 'file "sample.json"
,conv_error_name "TEST_S3", ignore_invalid_records
,fmt_json 1, format "APPLY(firstvalid(?,null))

Pretty printing with JSON files

Now we can see the contents of the address and can pull things from there into their own columns. Try it for the phone number, you will need to add “[]” as it’s an array.
A tip for using data from external tables: the connector still needs to connect to S3 and parse the JSON data on the fly which can be slow considering the amount of variables involved such as connections, proxies, amount of data etc. This slow access certainly won’t be pleasant when we want to use it so what we can do is create a view image of this table.

create view <your_schema>.<your_view> as
    select * from <your_schema>.<your_table>;
create view image <your_schema>.<your_view>;

The view image is essentially a snapshot of the external table data in memory which can be renewed for newer data by simply recreating it. This can be done manually but it’s recommended that you create a bash script to submit the query via wxsubmit and then schedule this to run hourly/daily/weekly via something like cron. It’s also a good idea to do any sort of cleaning or transforming at this view creation stage instead of “select *” so that it’s ready to use.

Next time we’ll use this with a much larger data set including visuals with Tableau and insights with external scripts.

Simple performance checks against your hardware cluster


Posted By : Simon Darkin 1 Comment
performance hardware cluster, cpu. benchmarks

Kognitio have a lot of experience commissioning clusters of new hardware for our MPP software product. As part of that process, we’ve developed a number of steps for validating the performance of new clusters, and these are the topic of this blog entry.


There are many Linux based benchmarking tools on the market however they are not usually installed by default, in which case some simple command line tools can be used to quickly establish if there is a potential hardware issue that warrants further investigation.    The following hardware components are covered by this topic:

  • CPU
  • Disk
  • Networking
  • RAM



A slow CPU or core could have an adverse effect on query performance, and so with the use of basic command line tools you can help identify laggards.  A background ‘for’ loop can be employed to ensure all cores/threads are tested simultaneously.


Integer arithmetic test

Invoked 8 times to run simultaneously against 8 cores


for i in `seq 1 8`; do time $(i=0; while (( i < 999999 )); do (( i ++ )); done)& done; wait


this will return the time taken to increment an integer over the specified range.  A comparison of the time taken by each core will help identify outliers


real    0m8.908s
user    0m8.765s
sys     0m0.140s

real    0m8.943s
user    0m8.789s
sys     0m0.156s

real    0m8.997s
user    0m8.761s
sys     0m0.112s

real    0m9.000s
user    0m8.853s
sys     0m0.144s

real    0m9.023s
user    0m8.881s
sys     0m0.140s

real    0m9.028s
user    0m8.861s
sys     0m0.168s

real    0m9.034s
user    0m8.857s
sys     0m0.176s

real    0m9.073s
user    0m8.781s
sys     0m0.156s


Whilst the test is running you can check that each core is under load by running top and expanding the output to show all cores.  If you do encounter outliers in the arithmetic test then you can use the output from top to identify which core(s) remain busy when others have finished


Cpu0  : 98.3%us,  1.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu1  : 99.0%us,  1.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu2  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu3  : 99.3%us,  0.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu4  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu5  : 98.0%us,  2.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu6  : 98.3%us,  1.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu7  : 98.7%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st


Compression test

As with the arithmetic test this example loops around 8 times so that 8 cores are tested simultaneously.  Data is written to /dev/null to avoid any overhead associated with disk IO.


for i in `seq 1 8`; do dd if=/dev/zero bs=1000 count=1000000 | gzip >/dev/null&  done; wait


this will return the rate at which each core is able to compress 1 GB of data


1000000000 bytes (1.0 GB) copied, 11.9277 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9277 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9545 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9799 seconds, 83.5 MB/s

1000000000 bytes (1.0 GB) copied, 11.9831 seconds, 83.5 MB/s

1000000000 bytes (1.0 GB) copied, 12.0085 seconds, 83.3 MB/s

1000000000 bytes (1.0 GB) copied, 12.0382 seconds, 83.1 MB/s

1000000000 bytes (1.0 GB) copied, 12.2655 seconds, 81.5 MB/s


With Kognitio software installed you can use the wxtool command  to run the compression test simultaneously against all database nodes to aid comparison across the cluster as a whole.  You can download the software for free at


wxtool -a '{can DB}' -S 'for i in `seq 1 8`; do dd if=/dev/zero bs=1000 count=1000000 | gzip >/dev/null&  done; wait'


For node kap1-1 (ecode 0, 866 bytes):

1000000000 bytes (1.0 GB) copied, 11.9422 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9659 seconds, 83.6 MB/s

1000000000 bytes (1.0 GB) copied, 11.9876 seconds, 83.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.0142 seconds, 83.2 MB/s

1000000000 bytes (1.0 GB) copied, 12.1293 seconds, 82.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.3754 seconds, 80.8 MB/s

1000000000 bytes (1.0 GB) copied, 12.4132 seconds, 80.6 MB/s

1000000000 bytes (1.0 GB) copied, 12.4386 seconds, 80.4 MB/s

For node kap1-3 (ecode 0, 864 bytes):

1000000000 bytes (1.0 GB) copied, 11.8398 seconds, 84.5 MB/s

1000000000 bytes (1.0 GB) copied, 11.8661 seconds, 84.3 MB/s

1000000000 bytes (1.0 GB) copied, 11.8893 seconds, 84.1 MB/s

1000000000 bytes (1.0 GB) copied, 11.9165 seconds, 83.9 MB/s

1000000000 bytes (1.0 GB) copied, 11.946 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.953 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9637 seconds, 83.6 MB/s

1000000000 bytes (1.0 GB) copied, 12.2996 seconds, 81.3 MB/s

For node kap1-3 (ecode 0, 866 bytes):

1000000000 bytes (1.0 GB) copied, 11.8757 seconds, 84.2 MB/s

1000000000 bytes (1.0 GB) copied, 11.8846 seconds, 84.1 MB/s

1000000000 bytes (1.0 GB) copied, 11.9178 seconds, 83.9 MB/s

1000000000 bytes (1.0 GB) copied, 11.9243 seconds, 83.9 MB/s

1000000000 bytes (1.0 GB) copied, 11.9377 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9834 seconds, 83.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.3367 seconds, 81.1 MB/s

1000000000 bytes (1.0 GB) copied, 12.3942 seconds, 80.7 MB/s

For node kap1-4 (ecode 0, 864 bytes):

1000000000 bytes (1.0 GB) copied, 11.91 seconds, 84.0 MB/s

1000000000 bytes (1.0 GB) copied, 11.9291 seconds, 83.8 MB/s

1000000000 bytes (1.0 GB) copied, 11.9448 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 11.9498 seconds, 83.7 MB/s

1000000000 bytes (1.0 GB) copied, 12.1232 seconds, 82.5 MB/s

1000000000 bytes (1.0 GB) copied, 12.3896 seconds, 80.7 MB/s

1000000000 bytes (1.0 GB) copied, 12.4449 seconds, 80.4 MB/s

1000000000 bytes (1.0 GB) copied, 12.4504 seconds, 80.3 MB/s





Having just one underperforming disk in the system can significantly impact query performance against disk based tables. Here are some simple tests to help identify any anomalies.


Iterative write speed test with dd.




for i in `seq 1 3`; do echo "Loop $i"; dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=10000 count=100000 conv=fsync; echo ""; done


this will return the duration and rate at data can be written out to disk.  In this example 1 GB of data is repeatedly written to a raw partition.  Note that fsync is used to flush the writeback cache and ensures data is written to the physical media.


Loop 1
100000+0 records in
100000+0 records out
1000000000 bytes (1.0 GB) copied, 13.6466 seconds, 73.3 MB/s

Loop 2
100000+0 records in
100000+0 records out
1000000000 bytes (1.0 GB) copied, 12.8324 seconds, 77.9 MB/s

Loop 3
100000+0 records in
100000+0 records out
1000000000 bytes (1.0 GB) copied, 12.4271 seconds, 80.5 MB/s


With Kognitio software installed, the test can be expanded to run on all database nodes allowing for easy comparison of all disks in the system


wxtool -a '{can DB}' -S 'for i in `seq 1 3`; do echo "Loop $i"; dd if=/dev/zero of=/dev/cciss/c0d0p2 bs=10000 count=100000 conv=fsync; echo ""; done'


Iterative read speed test with dd


for i in `seq 1 3`; do let skip=$i*5000; echo "Loop $i - skip = $skip"; sync ; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/cciss/c0d0p2 of=/dev/null bs=1000 count=1000000 skip=$skip ;echo ""; done


this will return the rate at which data can be read from disk.  In this example 1 GB of data is being read from a raw partition, adjusting the offset and flushing the buffer on each iteration to ensure data is being read from the physical media.


Loop 1 - skip = 5000
1000000+0 records in
1000000+0 records out
1000000000 bytes (1.0 GB) copied, 14.4355 seconds, 69.3 MB/s

Loop 2 - skip = 10000
1000000+0 records in
1000000+0 records out
1000000000 bytes (1.0 GB) copied, 12.9884 seconds, 77.0 MB/s

Loop 3 - skip = 15000
1000000+0 records in
1000000+0 records out
1000000000 bytes (1.0 GB) copied, 12.6045 seconds, 79.3 MB/s


With Kognitio software installed, the test can be expanded to run on all database nodes to aid comparison across the entire system.


wxtool -a '{can DB}' -S 'for i in `seq 1 3`; do let skip=$i*5000; echo "Loop $i - skip = $skip"; sync ; echo 3 >/proc/sys/vm/drop_caches; dd if=/dev/cciss/c0d0p2 of=/dev/null bs=1000 count=1000000 skip=$skip ;echo ""; done'


Iterative read speed test with hdparm


for i in `seq 1 3`; do echo "Loop $i"; hdparm --direct -t /dev/cciss/c0d0p2; echo ""; done


this will return the rate at which data can be read sequentially from disk without any file system overhead.


Loop 1
Timing O_DIRECT disk reads:  236 MB in  3.01 seconds =  78.40 MB/sec

Loop 2
Timing O_DIRECT disk reads:  236 MB in  3.02 seconds =  78.09 MB/sec

Loop 3
Timing O_DIRECT disk reads:  230 MB in  3.01 seconds =  76.30 MB/sec


With Kognitio software installed, the test can be expanded to run on all database nodes to aid comparison across the entire system.


wxtool -a '{can DB}' -S 'for i in `seq 1 3`; do echo "Loop $i"; hdparm --direct -t /dev/cciss/c0d0p2; echo ""; done'


Disk based table scan


If the cluster is running Kognitio database software you can initiate a scan of a large disk based table and review the output from wxtop in order to spot any disk store processes that remain busy for a significant period after others have finished.   For accurate results you should ensure there is no concurrent activity when performing this test.


select *
from <large disk based table>
where <condition unlikely to be true>;


Monitor the output from wxtop and look out for any disk store processes that remain busy when all or most others have finished.


PID       NODE        PROCESS                           SIZE      TIME
15784       kap1-1      WXDB(55): Diskstore             258036       100
22064       kap1-2      WXDB(18): Diskstore             257176        86
25179       kap1-3      WXDB(73): Diskstore top         258200        84
31237       kap1-4      WXDB(37): Diskstore             258068        77


If a disk store process does appear to lag behind, then you should eliminate the possibility of it being attributable to data skew by checking the row counts across all of the disks using the following query


sum(nrows) nrows
from ipe_ftable
where table_id = <table_id being scanned>
group by 1
order by 2 desc;




You can test the network links between nodes using some simple netcat commands.  This will allow you to spot links that are underperforming.


Link speed test using dd and netcat


The name and options associated with the netcat binary will depend on the Linux installation, however with Kognitio software installed you can use wxnetread and wxnetwrite for the data transfer regardless.


Setup a listening process on the node performing the read


netcat -l -p 2000 > /dev/null &


use dd to generate some data and pipe through netcat to the IP and port of the node performing the read


dd if=/dev/zero bs=1000 count=1000000 | netcat 2000


this will return the rate at which data can be copied over the link


1000000000 bytes (1.0 GB) copied, 8.54445 seconds, 117 MB/s


The same test as above this time using wxnetread/wxnetwrite


wxnetread -l -p 2000 > /dev/null &


dd if=/dev/zero bs=1000 count=1000000 | wxnetwrite 2000


1000000000 bytes (1.0 GB) copied, 8.5328 seconds, 117 MB/s


Shape tests

With Kognitio software installed you can run Shape tests to measure the speed at which RAM based data is re-distributed between nodes


wxtester -s <dsn> -u sys -p <password> -Ishape 5000 9000 1


Once the tests have been running for a few minutes you can navigate to the logs directory and check the data rate


cd `wxlogd wxtester`
grep TSTSHN results | gawk '{ if ($5==64) print ((($3*$5)/$6)/<number of database nodes>)/1048576 }'


With older generation hardware you can expect to see performance of  40MB/s/node given sufficient network bandwidth.       With newer hardware, for example HP Gen9 servers with 2x 56Gb/s links per node this increases to 90MB/s/core.




Benchmarking RAM performance is best left to a dedicated test suite, however you can perform a very simple write/read speed test using dd in conjunction with a temporary file storage facility in RAM, which at the very least can show up a mismatch in performance between nodes.


Write and read speed test using dd


mkdir RAM
mount tmpfs -t tmpfs RAM
cd RAM
dd if=/dev/zero of=ram_data bs=1M count=1000


this will return the rate at which data is written to RAM


1048576000 bytes (1.0 GB) copied, 1.06452 seconds, 985 MB/s


dd if=ram_data of=/dev/null bs=1M count=1000


this will return the rate at which data is read from RAM


1048576000 bytes (1.0 GB) copied, 0.6346 seconds, 1.7 GB/s


How can you give your big data the Spark it needs?


Posted By : Paul Groom Comments are off
Categories :Guides

big data, spark

For many firms, one of the biggest challenges when they are implementing big data analytics initiatives is dealing with the vast amount of information they collect in a timely manner.

Getting quick results is essential to the success of such a project. With the most advanced users of the technology able to gain real-time insights into the goings-on within their business and in the wider market, enterprises that lack these capabilities will struggle to compete. While the most alert companies can spot potential opportunities even before they emerge, they may have already passed-by by the time a slower business’ analytics have even noticed an opportunity.

So what can companies do to ensure they are not falling behind with their big data? In many cases, the speed of their analytics is limited by the infrastructure they have in place. But there are a growing number of solutions now available that can address these issues.

Spark and more

One of the most-hyped of these technologies is Apache Spark. This is open-source software that many are touting as a potential replacement for Hadoop. Its key features are much faster data processing speeds – claimed to be up to ten times faster on disk than Hadoop map reduce, or 100 times faster for in-memory operations.

In today’s demanding environment, this speed difference could be vital. With optional features for SQL, real-time stream processing and machine learning that promise far more than what generic Hadoop is capable of, these integrated components could be the key to quickly unlocking the potential of a firm’s data.

However, it shouldn’t be assumed that Spark is the only option available for companies looking to boost their data operations. There are a range of in-memory platforms (Kognitio being one!) and open-source platforms available to help with tasks like analytics and real-time processing, such as Apache Flink. And Hadoop itself should not be neglected: tech like Spark should not be seen as a direct replacement for this until its feature set matures, as they do not perform exactly the same tasks and can – and often should – be deployed together as part of a comprehensive big data solution.

Is your big data fast enough?

It’s also important to remember that no two businesses are alike, so not every firm will benefit from the tech in the same way. When deciding if Spark or analytical platforms like it are for you, there are several factors that need to be considered.

For starters, businesses need to determine how important speedy results are to them. If they have a need for timely or real-time results – for instance as part of a dynamic pricing strategy or if they need to monitor financial transactions for fraud – then the speed provided by Spark and it’s like will be essential.

As technology such as the Internet of Things becomes more commonplace in businesses across many industries, the speed provided by Spark and others will be beneficial. If companies are having to deal with a constant stream of incoming data from sensors, they will need an ability to deal with this quickly and continuously.

Giving your big data a boost

Turning to new technologies such as Spark or Flink can really help improve the speed, flexibility and efficiency of a Hadoop deployment. One of the key reasons for this is the fact that they take full advantage of in-memory technology.

In traditional analytics tools, information is stored, read-from and written-to physical storage solutions like hard disk drives during the process – map reduce will do this many times for a given job. This is typically one of the biggest bottlenecks in the processing operation and therefore a key cause of slow, poorly-performing analytics.

However, technologies such as Spark conduct the majority of their tasks in-memory – copying the information in much faster RAM and keeping it there as much as possible, where it can be accessed instantaneously. As the cost of memory continues to fall, these powerful capabilities are now within much easier reach of many businesses and at a scale not previously thought possible.

Are you ready to dive into the data lake?


Posted By : Paul Groom Comments are off
Categories :Guides

data lake, big data

By now, big data analytics has well and truly passed the hype stage, and is becoming an essential part of many businesses’ plans. However, while the technology is maturing quickly, there are still a great deal of questions about how to go about implementing it.

One strategy that’s appearing more often these days is the concept of the ‘data lake’. For the uninitiated, this involves taking all the information that a business gathers – often from a large range of wildly diverse sources and data types – and placing it into a single location from which it can be accessed and analysed.

Powerful tools such as Hadoop have made this a much more practical solution for all enterprises – so it’s no longer just something limited to the largest firms. But what will it mean for businesses in practice? Understanding how to make the most of a data lake – and what pitfalls need to be avoided – is essential if businesses are to convert the potential of their data into real-world results.

One repository, one resource

One of the key features of a data lake is that it enables businesses to break free from traditional siloed approaches to data warehousing. Because the majority of the data a company possesses is available in the same place, they can ensure they have a full picture when building analytical or reporting models.

As well as more certainty in the accuracy of results, taking a data lake approach means businesses will find it much easier and cheaper to scale up as their data volumes and usage grows. And not only is this scalability cheap, a strong data lake is capable of holding amounts of raw data that would be unthinkable for a traditional data warehouse.

A data lake will be particularly useful if a business is dealing with large amounts of unstructured data or widely varying data, where it can be difficult to assign a specific, known set of attributes to a piece of information – such as social media or image data.

Clear results need clear waters

However, in order to make the most of this, businesses still need to be very careful about the information they pour into their data lakes. Inaccurate, vague or outdated information will end up seriously compromising the results a company sees.

In order to deliver effective insights, the water in your data lake needs to be as crystal clear as possible. The murkier it gets, the worse it will perform. Think of it this way – you wouldn’t drink the water from a swamp, so why should you trust the results from a dirty data lake.

This concept – most famously known as garbage in, garbage out – can dictate the success or failure of a data lake strategy. Therefore, while it may be tempting to simply pump everything into a data lake and worry later about what it all means, taking the time to assess and clean all the information first will pay dividends when the time comes to review analytics results. Even something as straightforward as ensuring everything has accurate metadata attached can make a big difference when it comes to wading through the lake to qualify data for use in a study.

Making sure you don’t drown

Resisting this urge to simply dump data into a data lake, hoping that the sheer volume will overcome any shortcoming in the quality, will be one of the key factors in making a data lake strategy a success. But it’s not the only philosophy for a strong solution.

With so many sources of data in the same place, it’s easy to become overwhelmed and end up drowning in all the information potential. In order to get quality results, users must make sure they are asking appropriate questions of the right combinations of data. The more focused and specific analysts can make their queries, the more likely they are to see valuable outcomes.

Meanwhile, other risks of the data lake include the varied security implications. With so much data coming together in one place, they may be tempting targets for criminals, so strong protections, access controls and audit are a must.

As one of the most hyped technologies of the big data revolution, many firms will be interested in what data lakes can offer. But, as the old adage says, it’s important to look before you leap – otherwise you could be diving head-first into a world of trouble.

Kognitio Analytical Platform

Analytics for the Data Lake

Read more

R or Python: Choosing the right language for your big data deployment


Posted By : Paul Groom Comments are off
Categories :Guides

r, python, big data deployment

For any business embarking on a new big data analytics project, there will be many key questions that need answering. As well as the strategic queries, such as what use cases they intend to deploy the solution for and what the expected ROI will be, there are several key technology issues that need to be addressed.


Central to this will be the matter of what programming language they will use to develop their big data applications. There are a range of options for this, but among the most popular are Python and R. Both have their champions and detractors, so deciding which is best is a tricky decision.


It could be argued that choosing which to use is a matter of personal preference. While they each have their own pros and cons, if used correctly then either can help create powerful applications for the analysis of large quantities of data. But as with many things, it is rarely that simple.


The key use cases


Both Python and R are seeing their popularity among enterprise users grow rapidly, but their history is slightly different, and this tends to dictate the use cases they are most commonly deployed for. R, for instance, was originally designed by and for statisticians, and has been very popular among researchers and academics, while Python has a greater emphasis on productivity and readability.


This means that Python is often seen as easier if data analysis tasks need to be integrated with other applications, such as web apps or a production database. R, on the other hand, is mainly used when jobs require standalone analytics tasks, its interactive mode supporting flexible progressive interaction with data and the required analytics.


If companies are just setting out on their big data journey, R may be easier to use in the early stages, as statistical models can be created with just a few lines of code. However, once processes become more advanced, Python’s wide range of features can be hugely beneficial in creating algorithms for use in full production environments.


Usability and flexibility


The perceived user-friendliness of the two options, and how adaptable they are to the unique needs of the business, will also be major factors in their decision-making. In this regard, some people consider Python to have the advantage, as it is a common, easy-to-understand language that many programmers will be familiar with.


As a general purpose language, it is easy and intuitive, with an emphasis on readability. By contrast, R has a steeper learning curve that can make using it a daunting task for individuals coming to it for the first time. Without third-party tools to improve its performance, it is also a slow platform, making it unsuitable for applications where fast results are needed.


On the other hand, R is a very mature language and interactive workbench, with a rich, well-established ecosystem, and it also has strong graphical capabilities that make visualising data much easier. Python is still playing catch-up in both these regards.


Taking a pragmatic approach


However, the programming languages share many positives. For instance, both are distributed under an open-source license, meaning they are free for anyone to download and get started with. This also means there are many online communities developing advanced tools and offering support to their users.


Ultimately, it’s impossible to say definitively which of R or Python is the best programming language for big data operations. They both have a wide range of features and quirks, so the decision of which to use should be based on what you intend to do, as well as the skills you already have.


So, for instance, if you’ve determined that, say, Python is better suited to a certain task, but your data scientists already have some familiarity with R, you need to decide if it will be better for programmers to stick to what they know, rather than trying to learn a new tool.


Both solutions should be more than capable of handling most business big data tasks, so you need to look at what skills you have, what’s already in use in your industry and what problems you need to solve when evaluating your options.


And then there is Scala but that is a whole other topic…

Spotting the signs your big data project is heading for failure


Posted By : Paul Groom Comments are off
Categories :Guides

big data project, buisness departments, IT

The number of companies embarking on big data projects is set to increase hugely in the coming years, as awareness of the technology grows and more enterprises come to understand what it can offer them.

But making such deployments work effectively is no easy task, and the size of the business and the amount of resources it is able to devote to such projects may have little bearing on its success. According to Gartner, 85 per cent of Fortune 500 organisations will be unable to gain competitive advantage from big data in 2015, and only 60 per cent of initiatives will ever make it to the production stage.

There are many reasons why this may be the case. But often, there will be a few key warning signs companies should be alert to that may indicate their projects are heading for failure. By learning how to spot these signs early, businesses can change course and put themselves on the right path to ensure a positive outcome.

Who’s leading your efforts?

An effective big data analytics initiative needs to have input and engagement from all departments, including the IT team tasked with the implementation, the business units that will actually use it, and the board members who sign off on it. However, in many cases, organisations end up leaving the project in the hands of one team, which can create a range of problems.

For instance, some firms may consider it a good idea to allow their marketing teams to take the lead, as improving this department’s performance is a common goal for big data. However, without input from IT, these efforts may be difficult to monetise, or transfer across to other departments as users fail to fully understand how to utilise the technology.

On the other hand, it may seem to make sense to hand the reins over to data warehousing teams, as they will typically have vast experience with traditional business intelligence solutions. However, the consequence of this is they will likely be resistant to the type of wholesale change necessary to make big data work and, if left to their own devices, will not deliver a system fit for today’s purposes.

What’s your plan for progress?

Setting out a clear roadmap for the development of a big data tool is essential, but if key steps are done too early or too late, this can seriously hinder the effectiveness of a solution. For example, one common assumption is for a business to decide that they will need a skilled data scientist to manage their solution.

Therefore, they turn to sources such as LinkedIn seeking individuals with this skill listed, before they fully understand what their requirements are or what skills they will need. This can leave them unable to determine whether the candidates they are considering are really suitable – or if they have personnel already in their organisation that can be trained.

Similarly, it can be tempting to settle quickly on a technology to base a big data analytics platform on. Many firms are keen to get this done quickly, as it feels like once a solution has been selected, the bulk of the work is complete. However, once a tool is selected, a business is locked into a particular path, which can be time-consuming and costly to change if it later emerges the platform isn’t suitable.

Do you have the right expectations?

Another important warning sign that a big data analytics project may be likely to fail is if personnel – particularly at the board level – are taking a too results-driven approach to the initiative. This will be especially noticeable if advocates for the solution are being quizzed about expected ROI right from the start.

One of the problems with this is that it is usually difficult to put a clear monetary value on a big data project. It is often an unpredictable process of trial and error, and trying to consider returns via traditional methods may be doomed to failure, and inevitably lead to disappointment.

On the other hand, expecting to wait six months or more before seeing any results is also a sign of poor planning. Given the fluid nature of big data, it’s unavoidable there will be some failures along the way – the secret to success is to start small, ‘fail fast’ and respond to these failures as quickly as possible. If businesses are not planning on looking at results until six months down the line, by the time they identify potentially fatal problems, it will be too late to do anything other than start from scratch.

How IoT is changing the game for big data


Posted By : Paul Groom Comments are off
Categories :Guides

big data, IoT

As connectivity around the world continues to grow, one consequence of this is there will be an enormous increase in the number of devices able to gather data and send it back to businesses. These are not only limited to traditional connectivity gadgets such as smartphones, but a wide variety of sensors and other items that will be able to collect information on almost any parameter, whether in homes, factories, workplaces or vehicles.

This is called the Internet of Things (IoT), and it’s set to be big business in the coming years. Gartner estimates there will be 4.9 billion connected devices in use around the world this year – an increase of 30 per cent on 2014. However, this pales in comparison to the heights the technology is expected to reach in the coming years. By 2020, the market research firm predicts there will be 25 billion IoT gadgets in use, while International Data Corporation estimates the market will be worth $1.7 trillion by that year.

The next big thing for IT

But what is IoT, and what will it mean for the data analytics sector? In truth, the term covers a wide variety of devices, but it essentially boils down to equipping items with electronic sensors and connectivity that allows them to communicate with each other or central services.

This could be GPS devices on cars that inform insurance companies about an owner’s driving patterns, manufacturing applications that can alert users when maintenance is required for machines, or even domestic uses such as climate control apps or refrigerators that can automatically order groceries when supplies run low.

This means that regardless of the industry they are in, no company can afford to ignore the potential of the technology. Companies that do not have access to IoT data will lack crucial insights into their customers, preventing them from making appropriate decisions to increase their profitability, boost service levels and develop longer engagements with customers.

Challenges for analytics

But this will lead to a range of challenges for companies looking to incorporate IoT insight into their operations – particularly when it comes to gathering, storing and analysing the data that these devices can generate.

For starters, businesses need to have plans in place for how they will cope with the huge volume of data they can expect to receive, as well as dealing with a wide variety of mostly unstructured data. This is where tools such as Hadoop, which are capable of cost-effectively storing, processing and analysing this type of information, can prove highly useful.

Businesses also need to bear in mind the privacy implications of IoT, particularly when dealing with consumer oriented applications. With so many devices expected to gather data on almost every aspect of people’s lives, from what products they buy to what their movements are, there will naturally be concerns about how this information is used – both directly and indirectly. Therefore, companies must work hard to reassure their customers they have strong security and be transparent about their information processing activities – including any sharing of data with third parties.

A wide range of applications

When it comes to applying IoT solutions to a business’ operations, the possibilities are almost endless – and are already being used in a lot more places than many people realise. For example, gadgets such as the Apple Watch and Fitbit activity tracker feature sensors that can monitor the health and fitness of their wearer, providing information and advice. This is just one instance where IoT technology is being integrated into consumer products, but it’s far from the only option for IoT.

The real benefits of this technology are likely to come from more large-scale commercial and industrial applications. For instance, adding sensors to utilities infrastructure to create smart energy grids can enable suppliers to dynamically adjust their output in response to demand, or identify areas of the network where maintenance is needed.

Similarly, manufacturers can use IoT sensors to boost efficiency, while operators of transport services can keep an eye on their entire fleet and be alerted to any problems before they become a major issue. But whatever the use case, in order to be successful, an IoT deployment will have to be backed up by an advanced analytics solution that is able to take data from these disparate sources and translate it into actions – with or without the intervention of a human operator.

So you’ve got Hadoop – what are the next steps?


Posted By : Paul Groom Comments are off
Categories :Guides

Big data delivers benefits after 18 to 24 months

Big data delivers benefits after 18 to 24 months[/caption]Of all the technologies currently competing for attention in the big data analytics sector, one solution that no business can afford to ignore is Hadoop. Even though this platform is still in its relative infancy, projections for the future are highly optimistic.

Indeed, it was forecast by Forrester Research analyst Mike Gualtieri that as Hadoop continues to disrupt established ways of running analytics operations, it will become the only viable option for many users.

Speaking at the 2015 Hadoop Summit in San Jose, California, he said: “It’s a data operating system and a fundamental data platform that in the next couple of years 100 per cent of large companies will adopt.”

However, there’s a world of difference between adopting a solution and being able to make the most of it. While many companies may be driven to explore Hadoop as a result of the hype surrounding it, relatively few understand exactly how they will leverage the solution to improve their business once it is established.

More than just a storage solution

One of the biggest reasons why Hadoop deployments fail is because businesses do not use them to their full potential. In fact, in many cases, Hadoop is simply used as a cheap storage solution in which companies can dump all their data, without really considering what they do with this resource; the processing potential of Hadoop and its ecosystem is undervalued or mis-understood.

The fact that Hadoop offers a highly cost-effective way of storing data is only one of its key benefits, but it can lead to businesses failing to treat it as the powerful analytics platform it is capable of being. Coupled with the sometimes steep learning curve for the technology and its many components, it’s easy to see why companies fail to take full advantage of its potential.

The result of this is that instead of a useful ‘data lake’, where all of a business’ digital assets are easily available for continual analysis, companies end up with a ‘data attic’, in which lots of data is just parked and then forgotten about for many months. In these cases, by the time data scientists return to these attics, they will struggle to achieve timely value.

A clear plan

To avoid this, it’s vital that companies engage with their data as soon as possible. Even if they are not yet ready for running full analytics operations, encouraging users to pay close attention to the information they are inputting into their system has clear near-term benefits.

Therefore, it’s important that businesses don’t approach their Hadoop deployments with an attitude that sees them put all their data into the tools first, and figuring out what to do with it later. In order to be successful, a clear path to results will be needed, so users at all stages understand what the end-goal is and what steps will need to be taken along the way to achieve this.

If businesses don’t have such plans in place from the start, and instead treat their Hadoop as a data attic, then when they do eventually come back and look at big data analytics, they’re likely to find a sprawling mess of disparate data that requires a lot of work to convert it into useful insights.

Realising the true value

One of the best ways to prevent this is to ensure your business takes the time to assess its data estate for potential value right from the start. Instead of simply shovelling every scrap of raw data they collect into a Hadoop storage solution, companies need to effectively ‘triage’ their information base to determine whether or not elements will enhance or distract from the collective value.

Things to be considering at this stage include the quality of the data – how likely is it to be complete, clean and accurate – and how relevant you expect it to be for future use. It’s all too easy to just add data under the assumption that these are concerns to be thought about later, but doing this just creates potential clutter. To derive true value from your big data analytics, you need to plan carefully and appreciate that Hadoop needs to be much more than just a low-cost place to store your growing volumes of data.

Meeting your responsibilities – data plus security


Posted By : Paul Groom Comments are off
Categories :Guides

data plus security challenges

As more businesses adopt big data analytics solutions, one of the key questions that will need to be answered is how they will handle the vast amounts of information they collect. Developments such as social media, mobile devices and the Internet of Things make it easier than ever for companies to gather a wide range of data that can give insight on every aspect of their customers’ activities and preferences.

At the same time, users are more concerned than ever about what is being done with their information. Whether it is personal data being sold for profit, hackers looking to steal financial details, or a government monitoring its citizens’ activities, there has never been greater awareness among the public of the importance of data privacy and security.

Companies that are unable to demonstrate they care about their users’ privacy are likely to be viewed with suspicion by many consumers, who will be wary about doing any business with them that requires handing over personal details. Therefore, from both financial and reputational perspectives, it pays to have strong security policies in place.

The need for compliance

As well as reassuring customers about the safety of their personal data, enterprises are frequently dealing with a complex set of regulators concerning this, both at a national and international level.

For instance, 2015 will see the European Union implement a major update to its rules, with the introduction of its General Data Protection Regulation (GDPR). Intended to be a more unified solution to privacy to replace a patchwork of differing national rules, the regulations are likely to require many firms to overhaul their practices to ensure they remain compliant.

The GDPR will impact existing established rules such as the UK Data Protection Act, leading to further headaches for companies that are working with large volumes of data. Uncertainty about what their responsibilities are, and may be, remain widespread, with research from Kroll Ontrack suggesting four out of five IT managers do not know what they will have to do to be compliant with the GDPR.

Keeping your data under control

In order to meet these demands, there will be several challenges to be addressed, and these generally fall into two categories – technical and cultural. For the former, it will be incumbent on businesses to put in place the right solutions and standards, such as ISO 27001, in order to effectively and reliably manage their processes and protect their data.

However, it is the latter challenge – making the necessary cultural changes in an organisation to meet data protection expectations – where the real difficulty may lie. Altering attitudes to make the careful and considered management of data a priority must take place on a company-wide basis, and be driven from the very top.

This will require the implementation of clear best practices for handling sensitive customer data and improving working processes. A range of policies may play a part in this, such as ensuring that the use of data is limited to specific, defined uses, or anonymising it wherever possible.

Security plus performance

It will be vital that the measures businesses put in place are able to meet all regulatory requirements and keep data safe – but firms will also have to work hard to ensure that these processes do not get in the way of the effective analysis of information.

As the ‘data real estate’ of many firms grows, it can create a sprawling, hard-to-control environment that not only slows down analytics activities, but can open up data to unauthorised access. Information must be kept clean, accurate and relevant to ensure that the risks are minimised and analytical value can be derived.

It will also be important for enterprises to place a high priority on data provenance. This means understanding the flow of data around the business, who has access to it and exactly what path it takes. This not only helps businesses develop a better idea of how they are handling their information, it will be essential for auditing purposes, which is something more businesses will have to deal with on a regular basis as various industries come under closer scrutiny.

Keeping big data solutions secure is about much more than just encrypting it and implementing security tools to prevent breaches. Reassuring customers and building trust will be vitally important for firms, and this will require a major shift in how they think about their data management.

Incorporating R into your big data operations


Posted By : Paul Groom Comments are off
Categories :Guides

R open programming language

When businesses are looking at potential technology solutions for managing their big data analytics operations, they now have more choice than ever before. With a wide range of open-source and proprietary platforms from a number of vendors, selecting the most appropriate tool for a business’ needs has become a challenge in itself.

One solution that is growing in popularity is R. This is an open-source programming environment and language that has proven to be highly useful for delivering statistical and mathematical analysis of data; however does it lend itself well to the high demands of big data applications?

Companies can implement R-based solutions with little to no capital expenditure, which makes it an attractive option to many users. There are some downsides to the technology that must be considered before companies take the plunge for big data use cases.

The pros and cons of R

One major advantage of R is its open-source nature. This makes it modifiable and updateable and enables any business to tailor a solution to its needs. This, coupled with the huge range of analytics capabilities it can bring to an organisation, means users will have a great deal of freedom to conduct powerful analytics operations.

The affordable cost of the technology is also a key driver for many users, meaning that advanced statistical and mathematical modelling is no longer the preserve of the largest organisations, but is attainable for the masses – many students come out of universities with R experience.

However, there are also numerous challenges that need to be overcome to make an R-based big data deployment a success. Chief among these is the breadth and thus complexity of the technology, which can make it challenging for more casual users to take advantage of the full capabilities of R, and R by default is not scalable.

The steep learning curve, deep-level capabilities and extremely complex inner working of the technology means it is not an especially user-friendly tool, while the need for external dependencies in order to utilise certain modular features means businesses may encounter complicated deployments.

Answering the scalability question

One of the other issues that has been holding back R as an enterprise-wide solution is the question of scalability. R is designed to be a workstation tool that users can interact with on their own machines, and is not designed for large-scale operations. While this means that individuals can very effectively use it with sample data, moving beyond this to full-scale, large volume, analytics is tricky, often requiring additional frameworks.

Therefore, solving the question of how to move from small-scale, ‘hobbyist’ deployments to mass-scale, supercomputing solutions will be a key step in a successful R deployment. This may require users to implement additional frameworks in order to make the most of the technology, as well as ensure that the infrastructure behind the technology has enough power to support such operations.

Preparing a business for the future

This may seem like a lot of effort, but in today’s environment, businesses cannot afford an incomplete solution. Today’s organisations are frequently looking to move from a culture of reporting to one where forecasting and predictive analytics are primary in their users’ thinking. To do this, powerful tools such as R will be a must.

Until now, many experiments with R may have been on a small scale, with individual users running queries on a workstation level to sample relatively small amounts of data. But with information volumes continuing to grow, widening this out to full data analysis will be a must.

If businesses wish to place R at the heart of these plans and break out from the workstation level to more complex deployments, they will need smart, educated users who understand the ins and outs of the tool. While the steep learning curve of R may act as a deterrent to some companies, those that train staff and persevere will be able to enjoy the benefits of a powerful, affordable big data analytics solution that can help direct their overall business strategy.