Participate in the Kognitio Console beta test program


Posted By : Michael Atkinson Comments are off
Kognitio console, beta test program
Categories :Blog, Kognitio How To

Kognitio Console is Kognitio’s client side management program for the Kognitio Analytical Platform.

Some of its features are:

  • It allows inspection of the metadata tree for schemas, tables, views, external scripts, users, connectors, sessions, queues, etc.
  • It also gives an object view to each of these metadata objects, allowing their inspection and management.
  • There are also lots of tools, wizards and widgets to browse data in Hadoop, load and unload data, identify problem queries and many more.
  • There are also a set of reports and dashboards to monitor the state of Kognitio systems.
  • Macros may be written to extend Kognitio Console, the reports and dashboards are written in these XML macros.
  • Ad-Hoc queries may be executed.
  • KogScripts may be executed and debugged. KogScript is based on Lua but has Kognitio specific enhancements to run SQL natively.
  • It is an integrated development environment (IDE) for KogScripts, and external scripts running in Kognitio.

All this makes Kognitio Console a versatile tool, suitable for database admins, analysts and power users.

Kognitio Console is in constant development, beta and release candidate versions are provided from our update website. Beta and release candidates will be announced on the Kognitio Console forums

There are two ways of obtaining these betas:

Note that this is beta software, so has not gone through a full QA. However, we endeavour to make sure it has no obvious bugs and it will have passed our Console smoke tests. If you experience crashes or other bugs or deficiencies in these beta and release candidates, please report them to us, this will help us make the next release more stable.

By installing it into a different location than the default, you can have both this and the last stable releases installed at the same time.

Using Kognitio on Amazon Elastic Map/Reduce


Posted By : Andy MacLean Comments are off
Kognitio on Amazon EMR

Using Kognitio on Amazon Elastic Map Reduce

Amazon’s Elastic Map/Reduce product provides Hadoop clusters in the cloud. We’ve had several requests for the Hadoop version of our product to work with EMR. As of release 8.1.50-rel161221 we have made the two products compatible so you can use EMR to run Kognitio clusters. This article will show you how to get Kognitio clusters up and running on EMR.

In order to run Kognitio on EMR you will need:

This article assumes some basic familiarity with Amazon’s environment and the EMR feature so if you’re new to Amazon you’ll probably want to experiment with it a little first before trying to create a large Kognitio cluster. I’m also assuming that you’re creating a brand new EMR cluster just for Kognitio. If you want to integrate Kognitio with an existing EMR cluster you will need to modify these instructions accordingly.

You can also follow this instructional video:

Getting ready to start

Before you start you’re going to need to decide how to structure the Hadoop cluster and how the Kognitio cluster will look on it. Amazon clusters consist of various groups of nodes – the ‘master node’, which runs Hadoop specific cluster master programs like the HDFS namenode and Yarn resource manager, the ‘Core’ group of nodes, which hold HDFS data and run Yarn containers and optional extra ‘Task’ groups, which run Yarn jobs but don’t hold HDFS data. When running on Hadoop, Kognitio runs as a Yarn application with one or more controlling ‘edge nodes’ that also act as gateways for clients. The Kognitio software itself only needs to be installed on the edge node(s) as the user running it, it gets transfered to other nodes as part of the Yarn task that runs it.

For most EMR clusters it makes sense to use the EMR master node as the Kognitio edge node so that’s how this example will work. There are other possible choices here – you can just use one of the cluster nodes, you can spin up a specific task group node to run it or you can just have an arbitrary EC2 node with the right security settings and client software installed. However the master node is already doing similar jobs and using it is the simplest way to get up and running. For the rest of the cluster it’s easiest to have no task groups and run the whole application on Core nodes, although using task groups does work if you need to do that.

Configuring the master node

The master node also needs to be configured so that it can be used as the controlling ‘edge node’ for creating and managing one or more Kognitio clusters. For this to work you need to create a user for the software to run as, set it up appropriately and install/configure the Kognitio software under that user. Specifically:

  • Create a ‘kodoop’ user
  • Create an HDFS home directory for it
  • Setup authentication keys for it
  • Unpack the kodoop.tar.gz and kodoop_extras.tar.gz tarballs into the user’s home directory
  • Configure slider so it can find the zookeeper cluster we installed
  • Configure the Kognitio software to make clusters that use compressed messages

You can do this with the following shell script:


#change the s3 bucket for your site

sudo useradd -c "kodoop user" -d /home/kodoop -m kodoop
HADOOP_USER_NAME=hdfs hadoop fs -mkdir /user/kodoop
HADOOP_USER_NAME=hdfs hadoop fs -chown kodoop /user/kodoop
sudo cp -r ~ec2-user/.ssh ~kodoop
sudo chown -R kodoop ~kodoop/.ssh

aws s3 cp $S3BUCKET/kodoop.tar.gz /tmp
aws s3 cp $S3BUCKET/kodoop-extras.tar.gz /tmp

sudo su - kodoop <<EOF
tar -xzf /tmp/kodoop.tar.gz
tar -xzf /tmp/kodoop-extras.tar.gz
echo PATH=~/kodoop/bin:\\\$PATH >>~/.bashrc

grep -v '<\/configuration>' kodoop/slider/conf/slider-client.xml >/tmp/slider-client.xml
cat <<XXX >>/tmp/slider-client.xml
cp  kodoop/slider/conf/slider-client.xml  kodoop/slider/conf/slider-client.xml.orig
cp /tmp/slider-client.xml  kodoop/slider/conf/slider-client.xml

cat >kodoop/config/server_defaults.cfg <<XXX
[runtime parameters]
rs_messcomp=1    ## turn on message compression

This script creates the user first, then it pulls the tarballs from an s3 bucket called s3://kognitio-development (You’ll want to change that to be your own bucket’s name and upload the tarballs into it). It then switches to the kodoop user, extracts everything and configures slider. The slider configuration required is the location of the zookeeper server which was installed with the cluster. This will be on port 2181 of the master node and this is the information that goes into slider-client.xml.

The final part of the script defines the rs_messcomp=1 setting for Kognitio clusters created on the EMR instance. This setting enables message compression, which causes messages to get compressed (with the LZ4 compression algorithm) before being sent over a network. This setting is not normally used but we recommend it for Amazon because the network:cpu speed ratio is such that it results in a speedup.

You can transfer this script to the master node and run it as ec2-user once the cluster starts, but it’s a lot nicer to have this run automatically as part of the cluster startup. You can do this by transfering the script to S3 and putting it together in a directory with the tarballs (and editing the s3 bucket name in the script appropriately). You can then specify the script during cluster creation as a custom action to get it run automatically (see below).

Creating the EMR cluster

Go to the Amazon EMR service in the AWS web console and hit ‘create cluster’ to make a new EMR cluster. You will then need to use ‘go to advanced options’ because some of the settings you need are not in the quick options. Now you have 4 pages of cluster settings to go through in order to define your cluster. Once you’ve done this and created a working cluster you will be able to make more clusters by cloning and tweaking a previous one or by generating a command line and running it.

This section will talk you through the settings you need to get a Kognitio cluster running without really getting into the other settings available. The settings I don’t mention can be defined any way you like.

Software Selection and Steps

Choose ‘Amazon’ as the vendor, select the release you want (we’ve tested it with emr-5.2.1 at the time of writing). Kognitio only needs Hadoop and Zookeeper to be selected from the list of packages, although adding others which you may need to run alongside it won’t hurt.

In the ‘Edit software settings’ box you may find it useful to enter the following:


This instructs yarn to preserve container directories for 1 hour after a container exits, which is very useful if you need to do any debugging.

If you want to have the master node configured automatically as discussed above, you will need to add an additional step here to do that. You can add a step by setting the step type to ‘Custom JAR’ and clicking configure. The Jar Location field should be set to s3://elasticmapreduce/libs/script-runner/script-runner.jar (if you like you can do s3://<regionname>.elasticmapreduce/ to make this a local read) and the argument is the full s3 path for the script you uploaded to s3 in the section above (e.g. s3://kognitio-development/kog-masternode). The script will now run automatically on the masternode after startup and the cluster will come up with a ‘kodoop’ user created and ready to go.

Hardware Selection

In the hardware selection page you need to tell EMR how many nodes to use and which type of VM to use for them. Kognitio doesn’t put much load on the master node so this can be any instance type you like, the default m3.xlarge works well.

The Core nodes can generally be anything which has enough memory for your cluster and the right memory:CPU ratio for you. For optimal network performance you should use the largest of whatever node type instead of a larger number of smaller instances (so 3x r3.8xlarge instead of 6x r3.4xlarge for example). The r3.8xlarge or m4.16xlarge instance types are good choices. You will want to use more RAM than you have data because of the Hadoop overhead and the need for memory workspace for queries. A good rule of thumb is to have the total RAM of the nodes which will be used for the Kognitio cluster be between 1.5x and 2x the size of the raw data you want to load as memory images.

You won’t need any task groups for this setup.

General Cluster Settings and Security

In the ‘General Cluster Settings’ pane you will want to add a bootstrap action for your node. This is required because the AMI used by EMR needs to have a small amount of configuration done and some extra Linux packages installed in order for it to run Kognitio’s software. The best way to do this is to place a configuration script in an S3 bucket and define this as a ‘custom action’ boostrap action. The following script does everything you need:


sudo yum -y install glibc.i686 zlib.i686 openssl.i686 ncurses-libs.i686
sudo mount /dev/shm -o remount,size=90%
sudo rpm -i --nodeps /var/aws/emr/packages/bigtop/hadoop/x86_64/hadoop-libhdfs-*

This script installs some extra Linux packages required by Kognitio. Then it remounts /dev/shm to allow shared memory segments to use up to 90% of RAM. This is necessary because Kognitio clusters use shared memory segments for nearly all of the RAM they use. The final step looks a bit unusual but Amazon doesn’t provide us with a simple way to do this. Kognitio requires libhdfs but Amazon doesn’t install it out of the box unless you install a component which uses this. Amazon runs the bootstrap action before the relevant repositories have been configured on the node so the RPM can’t be installed via yum. By the time we come to use libhdfs all the dependencies will be in place and everything will work.

Finally, the Kognitio server will be accessible from port 6550 on the master node so you may want to configure the security groups in ‘EC2 Security Groups’ to make this accessible externally.

Creating a Kognitio cluster

Once you have started up your cluster and created the kodoop user (either manually or automatically), you are ready to build a Kognitio cluster. You can ssh into the master node as ‘kodoop’ and run ‘kodoop’. This will invite you to accept the EULA and display some useful links for documentation, forum support, etc that you might need later. Finally you can run ‘kodoop testenv’ to validate that the environment is working properly.

Once this is working you can create a Kognitio cluster. You will create a number of Yarn containers with a size you specify. You will need to choose a container size, container vcore count and a number of containers that you want to use for the cluster. Normally you’ll want to use a single container per node which uses nearly all of the memory. You can list the nodes in your cluster on the master node like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -list
17/01/09 18:40:26 INFO client.RMProxy: Connecting to ResourceManager at
Total Nodes:3
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers          RUNNING                             1         RUNNING                            2         RUNNING                            1

Then for one of the nodes, you can find out the resource limits like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -status
17/01/09 18:42:07 INFO client.RMProxy: Connecting to ResourceManager at
Node Report : 
        Node-Id :
        Rack : /default-rack
        Node-State : RUNNING
        Node-Http-Address :
        Last-Health-Update : Mon 09/Jan/17 06:41:43:741UTC
        Health-Report : 
        Containers : 0
        Memory-Used : 0MB
        Memory-Capacity : 253952MB
        CPU-Used : 0 vcores
        CPU-Capacity : 128 vcores
        Node-Labels :

The ‘Memory-Capacity’ field here shows the maximum container size you can create and CPU-Capacity shows the largest number of vcores. In addition to the Kognitio containers, the cluster also needs to be able to create a 2048MB application management container with 1 vcore. If you set the container memory size to be equal to the capacity and put one container on each node then there won’t be any space for the management container. For this reason you should subtract 1 from the vcore count and 2048 from the memory capacity.

You will also need to choose a name for the cluster which must be 12 characters or less and can only contain lower case letters, numbers and an underscore. Assuming we call it ‘cluster1’ we would then create a Kognitio cluster on the above example cluster like this:

CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1

This will display the following and invite you to confirm or cancel the operation:

[kodoop@ip-172-40-0-213 ~]$ CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1
Kognitio Analytical Platform software for Hadoop ver80150rel170105.
(c)Copyright Kognitio Ltd 2001-2017.

Creating Kognitio cluster with ID cluster1
Cluster configuration for cluster1
Containers:               3
Container memsize:        251904 Mb
Container vcores:         127

Internal storage limit:   100 Gb per store
Internal store count:     3

External gateway port:    6550

Kognitio server version:  ver80150rel170105

Cluster will use 738 Gb of ram.
Cluster will use  up to 300 Gb of HDFS storage for internal data.

Data networks:            all
Management networks:      all
Edge to cluster networks: all
Using broadcast packets:  no
Hit ctrl-c to abort or enter to continue

If this looks OK, hit enter and the cluster will be created. Once creation is completed you will have a working Kognitio server up and running and ready to use.

Next steps

At this point you should have a working Kognitio cluster up and ready to use. If you’re already a Kognitio user you probably know what you want to do next and you can stop reading here. This section is intended as a very brief quickstart guide to give new users an idea of the most common next steps. This is very brief and doesn’t cover all the things you can do. Full documentation for the features discussed below is available from our website.

You can download the Kognitio client tools from, install them somewhere, run Kognitio console and connect to port 6550 on the master node to start working with the server. Alternatively you can just log into the master node as kodoop and run ‘kodoop sql <system ID>’ to issue sql locally. Log in as ‘sys’ with the system ID as the password (it is a good idea to change this!).

There are now lots of different ways you can set up your server and get data into it but the most common thing to do is to build memory images (typically view images) to run SQL against. This is typically a two step process involving the creation of external tables which pull external data directly into the cluster followed by the creation of view images on top of these to pull data directly from the external source into a memory image. In some cases you may also want to create one or more regular tables and load data into them using wxloader or another data loading tool, in which case Kognitio will store a binary representation of the data in the HDFS filesystem.

Connecting to data in HDFS

Kognitio on Hadoop starts with a connector called ‘HDFS’ which is configured to pull data from the local HDFS filesystem. You create external tables which pull data from this either in Kognitio console or via SQL. To create external tables using console you can open the ‘External data sources’ part of the object tree and expand ‘HDFS’. This will allow you to browse the object tree from console and you’ll be able to create external tables by right clicking on HDFS files and using the external table creation wizard.

To create an external table directly from SQL you can use a syntax like this:

create external table name (<column list>) from HDFS target 'file /path/to/csv/files/with/wildcards';

Kognito is able to connect to a variety of different data sources and file formats in this manner. See the documentation for full details. As a quick example we can connect to a 6 column CSV file called test.csv like this:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from HDFS target 'file /path/to/file/test.csv';

If instead it is a directory full of csv files we can use ‘/path/to/file/test/*.csv’ instead to use them all as a single table in Kognitio.

Connecting to data in Amazon S3

Kognitio can also pull data directly out of Amazon S3. The Amazon connector is not loaded by default and it isn’t able to use the IAM credentials associated with the EMR nodes so you need to get a set of AWS credentials and configure your server with the following SQL:

create module aws;
alter module aws set mode active;
create group grp_aws;

create connector aws source s3 target 
secretkey "YOUR_SECRET_KEY"
max_connectors_per_node 5
bucket your-bucket-name

grant connect on connector aws to grp_aws;

This sql loads the Kognitio Amazon plugin, creates a security group to allow access to it and then creates an external table connector which uses the plugin. You will need to give the connector some Amazon credentials where it says YOUR_ACCESS_KEY and YOUR_SECRET_KEY and you will need to point it at a particular storage bucket. If you want to have multiple storage buckets or use multiple sets of credentials then create multiple connectors and grant permission on different ones to appropriate sets of users. Granting the ‘connect’ permission on a connector allows users to make external tables through it. In this case you can just add them to the group grp_aws which has this.

max_connectors_per_node is needed here because the amazon connector gives out of memory errors if you try to run too many instances of it in parallel on each node.

Now an external table can be created in exactly the same way as in the HDFS example. If my amazon bucket contains a file called test.csv with 6 int columns in it I can say:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from AWS target 'file test.csv';

Creating memory images

Once you have external tables defined your server is ready to start running queries, but each time you query an object the server will go out to the remote data and pull it into the server. Kognitio is capable of running like this but most people prefer to create memory images and query those instead because this allows data to be queried very fast. There are several different kinds of memory image in Kognitio but the most commonly used images are view images. With a view image the user defines a view in the normal SQL way and then they image it, which makes an in-memory snapshot of the query. This can be done with this SQL:

create view testv as select * from test;
create view image testv;

So testv is now a memory image. Images can be created with various different memory distributions which tell the server which nodes will store which rows. The most common of these are:

  • Hashed — A hash function on some of the columns determines which nodes get which rows
  • Replicated — Every row goes to every ram processing task
  • Random — Just put the rows anywhere. This what we will get in the example above.

The various memory distributions can be used to help optimise queries. The server will move rows about automatically if they aren’t distributed correctly but placing rows so they are co-located with certain other rows can improve performance. As a general rule:

  • Small tables (under 100M in size) work best replicated
  • For everything else hash on the primary key except
  • For the biggest images which join to non-replicated tables hash on the foreign key to the biggest of the foreign tables
  • Use random if it isn’t obvious what else to use

And the syntax for these is:

create view image test replicated;
create view image test hashed(column, column, column);
create view image test random;

Imaging a view which queries one or more external tables will pull data from the external table connector straight into RAM without needing to put any of it in the Kognitio internal storage. Once the images are built you are ready to start running SQL queries against them.

Many firms still lacking big data strategy


Posted By : admin Comments are off
big data strategy
Categories :#AnalyticsNews

Big data continues to be high on the agenda for many businesses, but despite the growing recognition of its importance, a large number of firms still do not have an effective plan in place for making the most of the technology.

This is among the key findings of a new survey conducted by DNV GL – Business Assurance and GFK Eurisko, which polled nearly 1,200 professionals from across Europe, Asia and the Americas. It revealed that many expect big data to play a significant role in future operations, but here is still a long way to go for many enterprises before this can be achieved.

More than three-quarters of respondents (76 per cent) predicted that investments in big data technology will be maintained or increased in the coming years, while two-thirds (65 per cent) are planning for an environment where big data is a key part of their operations.

But even though 52 per cent of professionals agreed that big data presents a clear business opportunity, only 23 per cent have a clear strategy in place for embracing the technology.

DNV GL noted that in order to make big data analytics a success, companies should treat it as a new journey, and make preparations and changes to their existing processes accordingly.

For example, 28 per cent of respondents say they have improved their information management procedures in order to make the adoption of advanced analytics tools as smooth as possible, while 25 per cent have implemented new technologies and methods for handling data.

However, fewer companies have worked on changing their day-to-day activities. Just 16 per cent have made efforts to change the culture or organisation to reflect a more data-driven approach, while 15 per cent have changed their business model.

"Big data is changing the game in a number of industries, representing new opportunities and challenges," said Luca Crisciotti, chief executive of DNV GL – Business Assurance. "I believe that companies that recognise and implement strategies and plans to leverage the information in their data pools have increased opportunities to become more efficient and meet their market and stakeholders better."

The survey found that all companies that have already adopted big data analytics report clear benefits from their efforts. For example, 23 per cent stated they have seen increased efficiency, 16 per cent reported better business decision making and 11 per cent witnessed financial savings. 

Meanwhile, 16 per cent stated their customer experience and engagement has improved as a result of big data, while nine per cent reported better relations with other stakeholders.

However, there are several factors that are still holding many firms back when it comes to adopting big data. Chief among these are a failure to develop an overall strategy and a lack of technical skills, both of which were named as issues by 24 per cent of respondents.

Therefore, getting the right personnel on board will be critical in making big data a success. These individuals need to understand the intricacies of big data analytics technologies, as well as take a leading role in preparing the business for the era of data.

Mr Crisciotti said: "The ability to use data to obtain actionable knowledge and insights is inevitable for companies that want to keep growing and profiting. The data analyst or scientist will be crucial in most organisations in the near future."

Address your big data challenges, the Kognitio Analytical Platform explained


Posted By : admin Comments are off
Categories :#AnalyticsNews, Blog

Watch how the Kognitio Analytical Platform provides highly scalable, in-memory analytical software that delivers ultra fast, high-concurrency SQL access to large and varied data using low-cost commodity hardware or Hadoop. When your growing user community wants ever faster query responses for complex workloads – they want unequivocal raw compute power by harenssing lots of CPUs efficiently doing lots of concurrent work, never waiting on slow disk. Enjoy the video, we had fun putting it together, leave us comments telling us what you think of it…


Learn more by visiting the Kognitio Analytical Platform page


Converged approaches to data ‘among key big data trends’ for 2016


Posted By : admin Comments are off
Image credit: iStockphoto/cifotart
Categories :#AnalyticsNews

A move away from centralised data storage approaches, converged analytics platforms and a greater focus on value and quality will be among the key trends facing the big data industry in 2016.

This is according to co-founder and chief executive of MapR John Schroeder, who wrote in an article for IT Pro Portal that as big data analytics has moved beyond a buzzword to become an essential part of many organisations' strategy, it is transforming the enterprise computing environment.

However, this is an area that's constantly evolving. "With many new innovative technologies on the horizon, not to mention a particularly noisy marketplace, differentiating between what is hype and what is just around the corner can be challenging," Mr Schroeder noted.

Therefore, he highlighted several key trends that all businesses looking to improve their big data analytics capabilities will have to consider in 2016.

One of the key areas of focus will be an effort to develop more converged analytics environments. Mr Schroeder said that in the past, it has been accepted best practice to keep operational and analytic systems in separate business applications, in order to prevent analytic workloads from disrupting operational processing.

But this attitude is changing as new tools emerge that can use in-memory data solutions to perform both online transaction processing (OLTP) and online analytical processing (OLAP) without the requirement for data duplication.

"In 2016, converged approaches will become more mainstream as leading organisations reap the benefits of combining production workloads with analytics in response to changing customer preferences, competitive pressures, and business conditions," the MapR chief executive stated. This convergence will also speed up the 'data to action' cycle and removes much of the latency between analytical processes its impact on business performance.

Mr Schroeder also forecast that 2016 will see a shift away from centralised workload and processing models to more distributed solutions. One reason for this will be to better deal with the challenges of managing multiple devices, data centres, and global use cases, across multiple locations.

Changes to overseas data security and protection rules brought about by the nullification of the EU-US Safe Harbor agreement will also dictate how companies store, share and process large quantities of data. With Safe Harbor 2.0 on the horizon and set to bring in new restrictions, global companies will need to re-evaluate their approach to cross-border data storage that will affect their analytics activities.

Elsewhere, it was predicted that 2016 will see the market focusing far less on the "bells and whistles" of the latest products, and more on established solutions that have proven business value.

"This year, organisations will recognise the attraction of a product that results in a tangible business impact, rather than on raw big data technologies – which, while promising an exciting new way of working, really just cloud the issues at hand," Mr Schroeder said.

Ultimately, vendors that are able to demonstrate quality will win out in 2016 as businesses demand proven, stable solutions to meet their requirements for better operational efficiency. 

"Now more than ever, an organisation's competitive stance relies on its ability to leverage data to drive business results. That's easier said than done when it’s pouring in from every origin imaginable," Mr Schroeder said.

Big data ‘to add £322bn’ to UK economy by 2020


Posted By : admin Comments are off
Big data 'to add £322bn' to UK economy by 2020
Categories :#AnalyticsNews

The value of big data analytics and Internet of Things (IoT) technology has been highlighted by a new report that forecast the solutions will add £322 billion to the UK's economy alone over the rest of the decade.

The paper is entitled 'The value of big data and the Internet of Things to the UK economy' and was published by the Centre for Economic and Business Research (Cebr) and SAS. It noted the figure is twice the size of the combined budget for education, healthcare and defence for 2014-15 and more than one-fifth of the UK's net public debt for that financial year.

Big data alone is expected to contribute an average of £40 billion a year to the UK economy between 2015 and 2020, and will be worth around 2.2 per cent of the country's gross domestic product by the end of the forecast period.

Manufacturing will be one of the big winners from this, with the sector expected to see a £57 billion boost between 2015 and 2020 as a direct result of big data. This is expected to be driven by the diversity of firms in the industry and the variety of areas in which efficiency gains can be achieved through the use of big data analytics

For example, the study suggested it could lead to improvements in supply chain management and enhancements in customer intelligence.

By 2020, two-third of UK business (67 per cent) are expected to have adopted big data solutions, up from 56 per cent last year. The technology will be particularly prevalent in retail banking, with 81 per cent of companies in this sector deploying solutions.

IoT is set for a similar boom, with the adoption rate increasing from 30 per cent in 2015 to 43 per cent by 2020.

Chief executive of Cebr Graham Brough said: "Collecting and storing data is only the beginning. It is the application of analytics that allows the UK to harness the benefits of big data and the IoT. Our research finds that the majority of firms have implemented between one and three big data analytics solutions."

However, he added that the key to success will be not only making sure these tools are extracting maximum insight, but also that firms are able to turn them  into business actions. 

"IoT is earlier in its lifecycle, and will provide more data for analysis in areas that may be new to analytics, reinforcing the potential benefits to the UK economy," Mr Brough said.

The most common reason given for adopting big data analytic tools was in order to gain better insight into customer behaviour. More than two-fifths (42 per cent) of organisations surveyed stated that they use big data for this purpose.

A similar proportion of businesses (39 per cent) will be turning to IoT solutions in order to cut costs and gain insight into operational data, the report continued.

How can you give your big data the Spark it needs?


Posted By : Paul Groom Comments are off
Categories :Guides

big data, spark

For many firms, one of the biggest challenges when they are implementing big data analytics initiatives is dealing with the vast amount of information they collect in a timely manner.

Getting quick results is essential to the success of such a project. With the most advanced users of the technology able to gain real-time insights into the goings-on within their business and in the wider market, enterprises that lack these capabilities will struggle to compete. While the most alert companies can spot potential opportunities even before they emerge, they may have already passed-by by the time a slower business’ analytics have even noticed an opportunity.

So what can companies do to ensure they are not falling behind with their big data? In many cases, the speed of their analytics is limited by the infrastructure they have in place. But there are a growing number of solutions now available that can address these issues.

Spark and more

One of the most-hyped of these technologies is Apache Spark. This is open-source software that many are touting as a potential replacement for Hadoop. Its key features are much faster data processing speeds – claimed to be up to ten times faster on disk than Hadoop map reduce, or 100 times faster for in-memory operations.

In today’s demanding environment, this speed difference could be vital. With optional features for SQL, real-time stream processing and machine learning that promise far more than what generic Hadoop is capable of, these integrated components could be the key to quickly unlocking the potential of a firm’s data.

However, it shouldn’t be assumed that Spark is the only option available for companies looking to boost their data operations. There are a range of in-memory platforms (Kognitio being one!) and open-source platforms available to help with tasks like analytics and real-time processing, such as Apache Flink. And Hadoop itself should not be neglected: tech like Spark should not be seen as a direct replacement for this until its feature set matures, as they do not perform exactly the same tasks and can – and often should – be deployed together as part of a comprehensive big data solution.

Is your big data fast enough?

It’s also important to remember that no two businesses are alike, so not every firm will benefit from the tech in the same way. When deciding if Spark or analytical platforms like it are for you, there are several factors that need to be considered.

For starters, businesses need to determine how important speedy results are to them. If they have a need for timely or real-time results – for instance as part of a dynamic pricing strategy or if they need to monitor financial transactions for fraud – then the speed provided by Spark and it’s like will be essential.

As technology such as the Internet of Things becomes more commonplace in businesses across many industries, the speed provided by Spark and others will be beneficial. If companies are having to deal with a constant stream of incoming data from sensors, they will need an ability to deal with this quickly and continuously.

Giving your big data a boost

Turning to new technologies such as Spark or Flink can really help improve the speed, flexibility and efficiency of a Hadoop deployment. One of the key reasons for this is the fact that they take full advantage of in-memory technology.

In traditional analytics tools, information is stored, read-from and written-to physical storage solutions like hard disk drives during the process – map reduce will do this many times for a given job. This is typically one of the biggest bottlenecks in the processing operation and therefore a key cause of slow, poorly-performing analytics.

However, technologies such as Spark conduct the majority of their tasks in-memory – copying the information in much faster RAM and keeping it there as much as possible, where it can be accessed instantaneously. As the cost of memory continues to fall, these powerful capabilities are now within much easier reach of many businesses and at a scale not previously thought possible.

Are you ready to dive into the data lake?


Posted By : Paul Groom Comments are off
Categories :Guides

data lake, big data

By now, big data analytics has well and truly passed the hype stage, and is becoming an essential part of many businesses’ plans. However, while the technology is maturing quickly, there are still a great deal of questions about how to go about implementing it.

One strategy that’s appearing more often these days is the concept of the ‘data lake’. For the uninitiated, this involves taking all the information that a business gathers – often from a large range of wildly diverse sources and data types – and placing it into a single location from which it can be accessed and analysed.

Powerful tools such as Hadoop have made this a much more practical solution for all enterprises – so it’s no longer just something limited to the largest firms. But what will it mean for businesses in practice? Understanding how to make the most of a data lake – and what pitfalls need to be avoided – is essential if businesses are to convert the potential of their data into real-world results.

One repository, one resource

One of the key features of a data lake is that it enables businesses to break free from traditional siloed approaches to data warehousing. Because the majority of the data a company possesses is available in the same place, they can ensure they have a full picture when building analytical or reporting models.

As well as more certainty in the accuracy of results, taking a data lake approach means businesses will find it much easier and cheaper to scale up as their data volumes and usage grows. And not only is this scalability cheap, a strong data lake is capable of holding amounts of raw data that would be unthinkable for a traditional data warehouse.

A data lake will be particularly useful if a business is dealing with large amounts of unstructured data or widely varying data, where it can be difficult to assign a specific, known set of attributes to a piece of information – such as social media or image data.

Clear results need clear waters

However, in order to make the most of this, businesses still need to be very careful about the information they pour into their data lakes. Inaccurate, vague or outdated information will end up seriously compromising the results a company sees.

In order to deliver effective insights, the water in your data lake needs to be as crystal clear as possible. The murkier it gets, the worse it will perform. Think of it this way – you wouldn’t drink the water from a swamp, so why should you trust the results from a dirty data lake.

This concept – most famously known as garbage in, garbage out – can dictate the success or failure of a data lake strategy. Therefore, while it may be tempting to simply pump everything into a data lake and worry later about what it all means, taking the time to assess and clean all the information first will pay dividends when the time comes to review analytics results. Even something as straightforward as ensuring everything has accurate metadata attached can make a big difference when it comes to wading through the lake to qualify data for use in a study.

Making sure you don’t drown

Resisting this urge to simply dump data into a data lake, hoping that the sheer volume will overcome any shortcoming in the quality, will be one of the key factors in making a data lake strategy a success. But it’s not the only philosophy for a strong solution.

With so many sources of data in the same place, it’s easy to become overwhelmed and end up drowning in all the information potential. In order to get quality results, users must make sure they are asking appropriate questions of the right combinations of data. The more focused and specific analysts can make their queries, the more likely they are to see valuable outcomes.

Meanwhile, other risks of the data lake include the varied security implications. With so much data coming together in one place, they may be tempting targets for criminals, so strong protections, access controls and audit are a must.

As one of the most hyped technologies of the big data revolution, many firms will be interested in what data lakes can offer. But, as the old adage says, it’s important to look before you leap – otherwise you could be diving head-first into a world of trouble.

Kognitio Analytical Platform

Analytics for the Data Lake

Read more

Data Science and Big Data Challenges


Posted By : Paul Groom Comments are off
Data Science, Big Data challenges
Categories :#AnalyticsNews, Blog

71 percent of data scientists believe their jobs have grown more difficult

Interesting study and observations on Data Scientists suffering from increasing difficulty with data, systems and parallelism for analytics. But this is partly brought on by trying to self-build from a diverse set of components that do not always fit particularly well or at least need more than a good nudge to click together. Analytics has to move to parallelism to cope with increasing volume and variety and keep pace with velocity of business needs – but the parallelism in Hadoop and Spark etc. is not refined or complete enough to easily support complex analytics.

Paradigm4 also found that 35 percent of data scientists who tried Hadoop or Spark have stopped using it.

Kognitio has experience in helping businesses get over these hurdles with world leading MPP analytical platform technology that intergrates directly with Hadoop and other big data stores, and with Analytical Services that help business teams solve requirements without a lot of technical fuss – Analytics for all at scale today.



Many CPUs make light work of complex text comparison


Posted By : Paul Groom Comments are off
Levenstein edit distance, text comparison
Categories :Blog

Feature of the Month – EDIT_DISTANCE function

A great example of how useful little functions can place extreme demands on your computing infrastructure by virtue of the underlying math and iterations required. In this case the function calculates the Levenstein Edit Distance for two strings. Kognitio’s scale-out processing across all available servers and all available cores conquers this with ease for large row sets.

From a business perspective this is about finding degrees of similarity in a mass of possibilities, text is all to prone to typo’s or misspellings – especially in the quick thumb typing of the social media world!

Read more from Dr Sharon Kirkham on this topic:

Forum Post
PDF Download