Strata + Hadoop World – San Jose

22

Mar
2017
Posted By : Sharon Kirkham 0 Comment
Categories :#AnalyticsNews, Blog

The Kognitio team had a great trip to Strata + Hadoop World in San Jose last week and we would like to say a big thank you to everyone who stopped by for a chat about getting enterprise level performance for their SQL on Hadoop. We look forwarding to hearing from you when you try out Kognitio on Hadoop.

At the start of the conference we released our benchmarking whitepaper in which Kognitio outperformed Impala and Spark in a TPC-DS benchmarking exercise. This proved to be of great interest and kept us all really busy on the stand. Conversations ranged from people who have been using Hadoop a while and are having problems serving data to their end-user applications such as Tableau and Qliksense right through to those that are just starting out on their Hadoop journey and wanted to understand what Kognitio can bring to their solution stack.

The subject matter of the conference sessions indicates that there is a period of consolidation going on within the Apache® Hadoop® solution stack. Most topics were discussing how to get the most from more established projects and the challenges of enterprise adoption. There was very little new research presented which was a bit disappointing.

 

Marcel Kornacker and Mostafa Mokhtar from Cloudera presented a talk on optimising Impala performance that was really interesting. They had also been using the TPC-DS query set for benchmarking but obviously had to use a cut down version of the query set (75 out of 99 queries). The optimisation details will be useful for us to follow for Impala when we do the next round of benchmarking after Kognitio 8.2 is released in April. Their benchmarks were at the 1 TB and 10TB scale. Increasing scale to 10TB and concurrency above 10 streams is something that we would definitely like to do during the next set of benchmarks.

From a maths perspective it was great to see Bayesian inference in the data science mix. Michael Lee Williams from Fast Forward Labs presented a great overview. I will certainly be checking out some of algorithms and tools with a view to parallelising them within Kognitio’s external scripting framework.

Data streaming also continues to be at the forefront of the conference . It was clear from the number of sessions in the conference that more companies (such as Capital One) have experiences they want to share as well as plenty of contributions from established technology leaders such as Confluent. It is certainly something that we are thinking about here.

If you didn’t make it to our booth at San Jose we hope to see you at one of these upcoming events:

DWS17, Munich, Sponsor, Big Data

We’ll be on Booth #1003.

See us at the next Strata Data Conference in London

23-25 May 2017

Booth #511.

 

Using Kognitio on Amazon Elastic Map/Reduce

12

Jan
2017
Posted By : Andy MacLean Comments are off
Kognitio on Amazon EMR

Using Kognitio on Amazon Elastic Map Reduce

Amazon’s Elastic Map/Reduce product provides Hadoop clusters in the cloud. We’ve had several requests for the Hadoop version of our product to work with EMR. As of release 8.1.50-rel161221 we have made the two products compatible so you can use EMR to run Kognitio clusters. This article will show you how to get Kognitio clusters up and running on EMR.

In order to run Kognitio on EMR you will need:

This article assumes some basic familiarity with Amazon’s environment and the EMR feature so if you’re new to Amazon you’ll probably want to experiment with it a little first before trying to create a large Kognitio cluster. I’m also assuming that you’re creating a brand new EMR cluster just for Kognitio. If you want to integrate Kognitio with an existing EMR cluster you will need to modify these instructions accordingly.

Getting ready to start

Before you start you’re going to need to decide how to structure the Hadoop cluster and how the Kognitio cluster will look on it. Amazon clusters consist of various groups of nodes – the ‘master node’, which runs Hadoop specific cluster master programs like the HDFS namenode and Yarn resource manager, the ‘Core’ group of nodes, which hold HDFS data and run Yarn containers and optional extra ‘Task’ groups, which run Yarn jobs but don’t hold HDFS data. When running on Hadoop, Kognitio runs as a Yarn application with one or more controlling ‘edge nodes’ that also act as gateways for clients. The Kognitio software itself only needs to be installed on the edge node(s) as the user running it, it gets transfered to other nodes as part of the Yarn task that runs it.

For most EMR clusters it makes sense to use the EMR master node as the Kognitio edge node so that’s how this example will work. There are other possible choices here – you can just use one of the cluster nodes, you can spin up a specific task group node to run it or you can just have an arbitrary EC2 node with the right security settings and client software installed. However the master node is already doing similar jobs and using it is the simplest way to get up and running. For the rest of the cluster it’s easiest to have no task groups and run the whole application on Core nodes, although using task groups does work if you need to do that.

Configuring the master node

The master node also needs to be configured so that it can be used as the controlling ‘edge node’ for creating and managing one or more Kognitio clusters. For this to work you need to create a user for the software to run as, set it up appropriately and install/configure the Kognitio software under that user. Specifically:

  • Create a ‘kodoop’ user
  • Create an HDFS home directory for it
  • Setup authentication keys for it
  • Unpack the kodoop.tar.gz and kodoop_extras.tar.gz tarballs into the user’s home directory
  • Configure slider so it can find the zookeeper cluster we installed
  • Configure the Kognitio software to make clusters that use compressed messages

You can do this with the following shell script:

#!/bin/bash

#change the s3 bucket for your site
S3BUCKET=s3://kognitio-development

sudo useradd -c "kodoop user" -d /home/kodoop -m kodoop
HADOOP_USER_NAME=hdfs hadoop fs -mkdir /user/kodoop
HADOOP_USER_NAME=hdfs hadoop fs -chown kodoop /user/kodoop
sudo cp -r ~ec2-user/.ssh ~kodoop
sudo chown -R kodoop ~kodoop/.ssh

aws s3 cp $S3BUCKET/kodoop.tar.gz /tmp
aws s3 cp $S3BUCKET/kodoop-extras.tar.gz /tmp

sudo su - kodoop <<EOF
tar -xzf /tmp/kodoop.tar.gz
tar -xzf /tmp/kodoop-extras.tar.gz
echo PATH=~/kodoop/bin:\\\$PATH >>~/.bashrc

hn=`hostname`
grep -v '<\/configuration>' kodoop/slider/conf/slider-client.xml >/tmp/slider-client.xml
cat <<XXX >>/tmp/slider-client.xml
  <property>
    <name>slider.zookeeper.quorum</name>
    <value>\$hn:2181</value>
  </property>
</configuration>
XXX
cp  kodoop/slider/conf/slider-client.xml  kodoop/slider/conf/slider-client.xml.orig
cp /tmp/slider-client.xml  kodoop/slider/conf/slider-client.xml

cat >kodoop/config/server_defaults.cfg <<XXX
[runtime parameters]
rs_messcomp=1    ## turn on message compression
XXX
EOF

This script creates the user first, then it pulls the tarballs from an s3 bucket called s3://kognitio-development (You’ll want to change that to be your own bucket’s name and upload the tarballs into it). It then switches to the kodoop user, extracts everything and configures slider. The slider configuration required is the location of the zookeeper server which was installed with the cluster. This will be on port 2181 of the master node and this is the information that goes into slider-client.xml.

The final part of the script defines the rs_messcomp=1 setting for Kognitio clusters created on the EMR instance. This setting enables message compression, which causes messages to get compressed (with the LZ4 compression algorithm) before being sent over a network. This setting is not normally used but we recommend it for Amazon because the network:cpu speed ratio is such that it results in a speedup.

You can transfer this script to the master node and run it as ec2-user once the cluster starts, but it’s a lot nicer to have this run automatically as part of the cluster startup. You can do this by transfering the script to S3 and putting it together in a directory with the tarballs (and editing the s3 bucket name in the script appropriately). You can then specify the script during cluster creation as a custom action to get it run automatically (see below).

Creating the EMR cluster

Go to the Amazon EMR service in the AWS web console and hit ‘create cluster’ to make a new EMR cluster. You will then need to use ‘go to advanced options’ because some of the settings you need are not in the quick options. Now you have 4 pages of cluster settings to go through in order to define your cluster. Once you’ve done this and created a working cluster you will be able to make more clusters by cloning and tweaking a previous one or by generating a command line and running it.

This section will talk you through the settings you need to get a Kognitio cluster running without really getting into the other settings available. The settings I don’t mention can be defined any way you like.

Software Selection and Steps

Choose ‘Amazon’ as the vendor, select the release you want (we’ve tested it with emr-5.2.1 at the time of writing). Kognitio only needs Hadoop and Zookeeper to be selected from the list of packages, although adding others which you may need to run alongside it won’t hurt.

In the ‘Edit software settings’ box you may find it useful to enter the following:

[{"classification":"core-site","properties":{"yarn.nodemanager.delete.debug-delay-sec":"3600"}}]

This instructs yarn to preserve container directories for 1 hour after a container exits, which is very useful if you need to do any debugging.

If you want to have the master node configured automatically as discussed above, you will need to add an additional step here to do that. You can add a step by setting the step type to ‘Custom JAR’ and clicking configure. The Jar Location field should be set to s3://elasticmapreduce/libs/script-runner/script-runner.jar (if you like you can do s3://<regionname>.elasticmapreduce/ to make this a local read) and the argument is the full s3 path for the script you uploaded to s3 in the section above (e.g. s3://kognitio-development/kog-masternode). The script will now run automatically on the masternode after startup and the cluster will come up with a ‘kodoop’ user created and ready to go.

Hardware Selection

In the hardware selection page you need to tell EMR how many nodes to use and which type of VM to use for them. Kognitio doesn’t put much load on the master node so this can be any instance type you like, the default m3.xlarge works well.

The Core nodes can generally be anything which has enough memory for your cluster and the right memory:CPU ratio for you. For optimal network performance you should use the largest of whatever node type instead of a larger number of smaller instances (so 3x r3.8xlarge instead of 6x r3.4xlarge for example). The r3.8xlarge or m4.16xlarge instance types are good choices. You will want to use more RAM than you have data because of the Hadoop overhead and the need for memory workspace for queries. A good rule of thumb is to have the total RAM of the nodes which will be used for the Kognitio cluster be between 1.5x and 2x the size of the raw data you want to load as memory images.

You won’t need any task groups for this setup.

General Cluster Settings and Security

In the ‘General Cluster Settings’ pane you will want to add a bootstrap action for your node. This is required because the AMI used by EMR needs to have a small amount of configuration done and some extra Linux packages installed in order for it to run Kognitio’s software. The best way to do this is to place a configuration script in an S3 bucket and define this as a ‘custom action’ boostrap action. The following script does everything you need:

#!/bin/bash

sudo yum -y install glibc.i686 zlib.i686 openssl.i686 ncurses-libs.i686
sudo mount /dev/shm -o remount,size=90%
sudo rpm -i --nodeps /var/aws/emr/packages/bigtop/hadoop/x86_64/hadoop-libhdfs-*

This script installs some extra Linux packages required by Kognitio. Then it remounts /dev/shm to allow shared memory segments to use up to 90% of RAM. This is necessary because Kognitio clusters use shared memory segments for nearly all of the RAM they use. The final step looks a bit unusual but Amazon doesn’t provide us with a simple way to do this. Kognitio requires libhdfs but Amazon doesn’t install it out of the box unless you install a component which uses this. Amazon runs the bootstrap action before the relevant repositories have been configured on the node so the RPM can’t be installed via yum. By the time we come to use libhdfs all the dependencies will be in place and everything will work.

Finally, the Kognitio server will be accessible from port 6550 on the master node so you may want to configure the security groups in ‘EC2 Security Groups’ to make this accessible externally.

Creating a Kognitio cluster

Once you have started up your cluster and created the kodoop user (either manually or automatically), you are ready to build a Kognitio cluster. You can ssh into the master node as ‘kodoop’ and run ‘kodoop’. This will invite you to accept the EULA and display some useful links for documentation, forum support, etc that you might need later. Finally you can run ‘kodoop testenv’ to validate that the environment is working properly.

Once this is working you can create a Kognitio cluster. You will create a number of Yarn containers with a size you specify. You will need to choose a container size, container vcore count and a number of containers that you want to use for the cluster. Normally you’ll want to use a single container per node which uses nearly all of the memory. You can list the nodes in your cluster on the master node like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -list
17/01/09 18:40:26 INFO client.RMProxy: Connecting to ResourceManager at ip-172-40-0-213.eu-west-1.compute.internal/172.40.0.213:8032
Total Nodes:3
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
ip-172-40-0-91.eu-west-1.compute.internal:8041          RUNNING ip-172-40-0-91.eu-west-1.compute.internal:8042                             1
ip-172-40-0-126.eu-west-1.compute.internal:8041         RUNNING ip-172-40-0-126.eu-west-1.compute.internal:8042                            2
ip-172-40-0-216.eu-west-1.compute.internal:8041         RUNNING ip-172-40-0-216.eu-west-1.compute.internal:8042                            1

Then for one of the nodes, you can find out the resource limits like this:

[kodoop@ip-172-40-0-213 ~]$ yarn node -status ip-172-40-0-91.eu-west-1.compute.internal:8041
17/01/09 18:42:07 INFO client.RMProxy: Connecting to ResourceManager at ip-172-40-0-213.eu-west-1.compute.internal/172.40.0.213:8032
Node Report : 
        Node-Id : ip-172-40-0-91.eu-west-1.compute.internal:8041
        Rack : /default-rack
        Node-State : RUNNING
        Node-Http-Address : ip-172-40-0-91.eu-west-1.compute.internal:8042
        Last-Health-Update : Mon 09/Jan/17 06:41:43:741UTC
        Health-Report : 
        Containers : 0
        Memory-Used : 0MB
        Memory-Capacity : 253952MB
        CPU-Used : 0 vcores
        CPU-Capacity : 128 vcores
        Node-Labels :

The ‘Memory-Capacity’ field here shows the maximum container size you can create and CPU-Capacity shows the largest number of vcores. In addition to the Kognitio containers, the cluster also needs to be able to create a 2048MB application management container with 1 vcore. If you set the container memory size to be equal to the capacity and put one container on each node then there won’t be any space for the management container. For this reason you should subtract 1 from the vcore count and 2048 from the memory capacity.

You will also need to choose a name for the cluster which must be 12 characters or less and can only contain lower case letters, numbers and an underscore. Assuming we call it ‘cluster1’ we would then create a Kognitio cluster on the above example cluster like this:

CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1

This will display the following and invite you to confirm or cancel the operation:

[kodoop@ip-172-40-0-213 ~]$ CONTAINER_MEMSIZE=251904 CONTAINER_VCORES=127 CONTAINER_COUNT=3 kodoop create_cluster cluster1
Kognitio Analytical Platform software for Hadoop ver80150rel170105.
(c)Copyright Kognitio Ltd 2001-2017.

Creating Kognitio cluster with ID cluster1
=================================================================
Cluster configuration for cluster1
Containers:               3
Container memsize:        251904 Mb
Container vcores:         127

Internal storage limit:   100 Gb per store
Internal store count:     3

External gateway port:    6550

Kognitio server version:  ver80150rel170105

Cluster will use 738 Gb of ram.
Cluster will use  up to 300 Gb of HDFS storage for internal data.

Data networks:            all
Management networks:      all
Edge to cluster networks: all
Using broadcast packets:  no
=================================================================
Hit ctrl-c to abort or enter to continue

If this looks OK, hit enter and the cluster will be created. Once creation is completed you will have a working Kognitio server up and running and ready to use.

Next steps

At this point you should have a working Kognitio cluster up and ready to use. If you’re already a Kognitio user you probably know what you want to do next and you can stop reading here. This section is intended as a very brief quickstart guide to give new users an idea of the most common next steps. This is very brief and doesn’t cover all the things you can do. Full documentation for the features discussed below is available from our website.

You can download the Kognitio client tools from www.kognitio.com, install them somewhere, run Kognitio console and connect to port 6550 on the master node to start working with the server. Alternatively you can just log into the master node as kodoop and run ‘kodoop sql <system ID>’ to issue sql locally. Log in as ‘sys’ with the system ID as the password (it is a good idea to change this!).

There are now lots of different ways you can set up your server and get data into it but the most common thing to do is to build memory images (typically view images) to run SQL against. This is typically a two step process involving the creation of external tables which pull external data directly into the cluster followed by the creation of view images on top of these to pull data directly from the external source into a memory image. In some cases you may also want to create one or more regular tables and load data into them using wxloader or another data loading tool, in which case Kognitio will store a binary representation of the data in the HDFS filesystem.

Connecting to data in HDFS

Kognitio on Hadoop starts with a connector called ‘HDFS’ which is configured to pull data from the local HDFS filesystem. You create external tables which pull data from this either in Kognitio console or via SQL. To create external tables using console you can open the ‘External data sources’ part of the object tree and expand ‘HDFS’. This will allow you to browse the object tree from console and you’ll be able to create external tables by right clicking on HDFS files and using the external table creation wizard.

To create an external table directly from SQL you can use a syntax like this:

create external table name (<column list>) from HDFS target 'file /path/to/csv/files/with/wildcards';

Kognito is able to connect to a variety of different data sources and file formats in this manner. See the documentation for full details. As a quick example we can connect to a 6 column CSV file called test.csv like this:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from HDFS target 'file /path/to/file/test.csv';

If instead it is a directory full of csv files we can use ‘/path/to/file/test/*.csv’ instead to use them all as a single table in Kognitio.

Connecting to data in Amazon S3

Kognitio can also pull data directly out of Amazon S3. The Amazon connector is not loaded by default and it isn’t able to use the IAM credentials associated with the EMR nodes so you need to get a set of AWS credentials and configure your server with the following SQL:

create module aws;
alter module aws set mode active;
create group grp_aws;

create connector aws source s3 target 
'
accesskey YOUR_ACCESS_KEY
secretkey "YOUR_SECRET_KEY"
max_connectors_per_node 5
bucket your-bucket-name
';

grant connect on connector aws to grp_aws;
;

This sql loads the Kognitio Amazon plugin, creates a security group to allow access to it and then creates an external table connector which uses the plugin. You will need to give the connector some Amazon credentials where it says YOUR_ACCESS_KEY and YOUR_SECRET_KEY and you will need to point it at a particular storage bucket. If you want to have multiple storage buckets or use multiple sets of credentials then create multiple connectors and grant permission on different ones to appropriate sets of users. Granting the ‘connect’ permission on a connector allows users to make external tables through it. In this case you can just add them to the group grp_aws which has this.

max_connectors_per_node is needed here because the amazon connector gives out of memory errors if you try to run too many instances of it in parallel on each node.

Now an external table can be created in exactly the same way as in the HDFS example. If my amazon bucket contains a file called test.csv with 6 int columns in it I can say:

create external table test (f1 int, f2 int, f3 int, f4 int, f5 int, f6 int) from AWS target 'file test.csv';

Creating memory images

Once you have external tables defined your server is ready to start running queries, but each time you query an object the server will go out to the remote data and pull it into the server. Kognitio is capable of running like this but most people prefer to create memory images and query those instead because this allows data to be queried very fast. There are several different kinds of memory image in Kognitio but the most commonly used images are view images. With a view image the user defines a view in the normal SQL way and then they image it, which makes an in-memory snapshot of the query. This can be done with this SQL:

create view testv as select * from test;
create view image testv;

So testv is now a memory image. Images can be created with various different memory distributions which tell the server which nodes will store which rows. The most common of these are:

  • Hashed — A hash function on some of the columns determines which nodes get which rows
  • Replicated — Every row goes to every ram processing task
  • Random — Just put the rows anywhere. This what we will get in the example above.

The various memory distributions can be used to help optimise queries. The server will move rows about automatically if they aren’t distributed correctly but placing rows so they are co-located with certain other rows can improve performance. As a general rule:

  • Small tables (under 100M in size) work best replicated
  • For everything else hash on the primary key except
  • For the biggest images which join to non-replicated tables hash on the foreign key to the biggest of the foreign tables
  • Use random if it isn’t obvious what else to use

And the syntax for these is:

create view image test replicated;
create view image test hashed(column, column, column);
create view image test random;

Imaging a view which queries one or more external tables will pull data from the external table connector straight into RAM without needing to put any of it in the Kognitio internal storage. Once the images are built you are ready to start running SQL queries against them.

Chief data officers ‘essential’ to big data success

13

Dec
2016
Posted By : admin Comments are off
131216 - Image credit: iStockphoto/emyerson
Categories :#AnalyticsNews

Organisations that invest in skilled executives to manage their big data analytics projects are better-placed to see success in this area than those that do not, a new report has indicated.

A study of US federal agencies conducted by MeriTalk and ViON Corporation revealed that almost all these bodies (92 per cent) use big data to some degree. However, the majority (58 per cent) graded the effectiveness of their data management strategy as C or worse.

Therefore, having the right personnel on hand to control the direction of such projects will be invaluable. The study found that 88 per cent of organisations with a chief data officer (CDO) leading these efforts report these executives have had a positive impact on their performance.

Meanwhile, 93 per cent of agencies that currently lack a CDO agreed that employing one would have a positive effect on their big data strategies.

Two-thirds (67 per cent) of organisations that do not have a CDO stated their agency lacks leadership when it comes to big data analytics efforts. Organisations with a CDO are also more likely to successfully incorporate big data analytics into their decision making than those without (61 per cent compared with 28 per cent).

Rodney Hite, director of big data and analytics solutions at ViON, said that as organisations are being inundated with huge amounts of data every day, how they manage this information and turn it into insight will be critical.

"Implementing a CDO ensures your agency is focusing the right amount on mission-critical data management goals – while storing and protecting data throughout the process," he continued. "Regardless of whether an agency has one or not, the majority – 57 per cent – believe the CDO will be the hero of big data and analytics."

More than three-quarters (76 per cent) or organisations with a CDO say this individual has taken ownership of data management and governance issues. The primary responsibilities of these personnel include centralising an organisation's data (55 per cent), protecting this information (51 per cent) and improving the quality of data (49 per cent).

Other areas where CDOs have influence include coping with open government data efforts, bridging the gap between IT and operations and "leveraging data to help set and achieve realistic goals".

However, although the benefits of having a CDO are clear, many agencies are not giving these personnel the support they need. The research found just one in four organisations (25 per cent) have a deputy CDO, while the same number have a chief data scientist and only 29 per cent have a chief analytics officer.

This is a situation that is unlikely to change in the near future, as less than a quarter of survey respondents expect to be hiring for any of these roles in the next two years.

However, the good news is that 92 per cent of agencies report their CDO has a strong working relationship with the chief information officer, which ensures the organisation is able to keep pace with the technological realities of big data and analytics. 

Don’t delete big data, companies urged

06

Dec
2016
Posted By : admin Comments are off
dont delete big data companies urged
Categories :#AnalyticsNews

Companies performing ad-hoc big data analytics operations have been reminded of the importance of keeping the data used in the process after it is completed.

Speaking at an IT Leaders Forum organised by Computing.com, director of file, object storage and big data flash at IBM Alex Chen explained businesses may need to refer back to this information at a later date. This may be in order to meet regulatory requirements, or simply if people want to investigate what happened and why a particular decision was taken.

At the moment, many organisations are still in the early adoption stage when it comes to big data, which means they may be performing a large number of experimental and ad-hoc analyses as they learn how to bring this technology into their everyday operations.

Mr Chen said: "It's likely that someone in a line-of-business [in many organisations] has spinned-up a Hadoop cluster and called it their big data analytics engine. They find a bunch of x86 servers with storage, and run HDFS."

Many people tend to throw away this data after it has been processed in order to keep their system running efficiently. Mr Chen noted that even in these ad-hoc deployments, it is not terabytes, but petabytes of data that are being ingested, and the more data that has to be analysed, the longer it will take.

But while deleting this data may keep analytics processes running as fast as possible, it could mean businesses have no answers when they need to demonstrate what led them to their final decision.

"Performing analytics generates a lot more meta-data, too, and due to regulations or business requirements people may just want to see what happened and why they made certain decisions. So you will need to re-run the analytics that were run before," Mr Chen continued. "So you can't just throw away the data any more."

Harvard seeks to tackle big data storage challenges

01

Dec
2016
Posted By : admin Comments are off
big data storage challenges, growth
Categories :#AnalyticsNews

With a growing number of companies looking to expand their big data analytics operations in the coming years, one key consequence of this will be an explosion in the amounts of data that businesses will have to store.

Therefore, finding cost-effective solutions for this will be essential if such initiatives are to be successful. While turning to technologies such as cloud computing could be the answer for many businesses today, as data volumes continue to grow at an exponential rate, new and improved solutions may be required.

This is why developers at Harvard University have been working to develop new infrastructure that is able to cope with this influx of information and support critical research taking place throughout the institution.

James Cuff, Harvard assistant dean and distinguished engineer for research computing, said: "People are downloading now 50 to 80 terabyte data sets from NCBI [the National Center for Biotechnology Information] and the National Library of Medicine over an evening. This is the new normal. People [are] pulling genomic data sets wider and deeper than they’ve ever been."

He added that what used to be considered cutting edge practices that depended on large volumes of data are now standard procedures.

Therefore, the need for large storage capabilities is obvious. That's why earlier this year, Harvard received a grant of nearly $4 million from the National Science Foundation for the development of a new North East Storage Exchange (NESE). This is a collaboration between five universities in the region, with Massachusetts Institute of Technology, Northeastern University, Boston University, and the University of Massachusetts also taking part.

The NESE is expected to provide not only enough storage capacity for today's massive data sets, but also give the participating institutions the high-speed infrastructure that is necessary if data is to be retrieved quickly for analysis.

Professor Cuff stated that one of the key elements of the NESE is that it uses scalable architecture, which will ensure it is able to keep pace with growing data volumes for the coming years. He noted that by 2020, officials hope to have more that 50 petabytes of storage capacity available at the project's Massachusetts Green High Performance Computing Center (MGHPCC).

John Goodhue, MGHPCC's executive director and a co-principal investigator of NESE, added that he also expects the speed of the connection to collaborating institutions to double or triple over the next few years.

Professor Cuff noted that while NESE could be seen as a private cloud for the collaborating institutions, he does not expect it to compete with commercial cloud solutions. Instead, he said it gives researchers a range of data storage options for their big data-driven initiatives, depending on what they hope to achieve.

"This isn't a competitor to the cloud. It’s a complementary cloud storage system," he said.

Financial services firms to embrace real-time analytics

30

Nov
2016
Posted By : admin Comments are off
financial services embrace real time analytics
Categories :#AnalyticsNews

A growing number of companies in the financial services sector are set to upgrade their big data analytics initiatives to include real-time solutions, a new report has claimed.

A study by TABB Group noted there is an increasing understanding in the sector that the value of a given piece of data can be lost almost immediately as it becomes outdated. Therefore, capital markets firms are turning to real-time analytics for activities including risk management, compliance, consumer metrics and turning insight into revenue.

Author of the report Monica Summerville noted that simply having data is no longer useful, and traditional ways of thinking about analytics, such as data warehousing and batch-led approaches to analytics, no longer apply.

In today's environment, firms must be able to find and act on patterns in incredibly large data sets in real time, while also being able to reference older, stored data as part of a streaming analytics operation without reverting to batch processing.

"The market for real time big data analytics is potentially as large as business intelligence, real-time streaming and big data analytics combined," Ms Summerville said. "The most successful approaches understand the importance of data acquisition to this process and successfully combine the latest open source technologies with market leading commercial solutions."

Implementing effective solutions for this will be challenging and requires companies to invest in software, hardware and data, as well as personnel with expertise in the sector.

Therefore, in order to ensure businesses can see a quick return on investment, TABB stated they will have to take big data analytics 'upstream' by layering streaming and static big data sets to support real time analysis of combined data sets. 

Such capabilities will be a key requirement if financial services firms are to progress to technologies like machine learning and other artificial intelligence based analytics.

Ms Summerville said: "We believe the upstream analytics approach will increasingly be adopted throughout the industry in response to industry drivers, an unending desire for new sources of alpha and the rising complexity of investment approaches."

How HelloFresh embraced Hadoop

28

Nov
2016
Posted By : admin Comments are off
how HelloFresh embraced Hadoop
Categories :#AnalyticsNews

As businesses grow, it becomes more critical for them to have a solution that will effectively handle the increasing amounts of data they generate. However, one problem that many organisations find when they are expanding is that tools that were adequate when they were developed are not able to scale along with the company.

This was the problem facing Berlin-based home meal delivery firm HelloFresh. The five-year-old firm has expanded rapidly and now delivers more than 7.5 million meals a month to 800,000 subscribers in multiple countries. Therefore, it found itself quickly outgrowing the custom-made business intelligence system it had long relied on, and needed a new solution.

In a recent interview with InformationWeek, chief technology officer at the company Nuno Simaria explained how the company had been using a home-built business intelligence system based around PHP, using a mix of a relational database and key value storage for pre-calculated data. However, as the business grew, the limitations of this became clear.

One problem was it did not offer the flexibility or detail analysts needed. While it could track essential KPIs to provide details of what was happening within the business, it was unable to offer insight into the reasons behind any changes.

"It was definitely not a good idea, but at the time it was the technology we were most comfortable with," Mr Simaria said.

The system was also approaching the limits of its capacity, so it became obvious a change was required. The company looked at several options that would offer improved big data analytics performance, including MemSQL and SAP HANA, but ultimately, it was Apache Hadoop that won out.

Part of the reason for this was its low cost compared with competitors. Because the tools can offer high performance even on inexpensive commodity hardware, there was no need for HelloFresh to upgrade these areas. This made Hadoop a highly attractive option, even though the company's team did not have much familiarity with the technology.

This led to its own challenges. Mr Simaria explained that finding skilled engineers in the market was very difficult. Therefore, the firm's approach was to give two of its existing staff the time and resources they needed to learn about the tools.

"We'll give you the budget, and we'll give you the time," he said. "This is something we've done with other technologies as well. If it is not easy for us to access talent in the market in the short term, we will empower our developers and our engineers who are interested in problem solving, and we will let them discover the complexities of that technology."

At the end of this process, the engineers had to answer three questions: is Hadoop the right technology; how can the firm migrate existing resources to it; and what distribution should be used moving forward?

The result of the Hadoop deployment is that HelloFresh now has much faster insight into goings-on within the businesses, and is also able to delve much deeper into its data in order to uncover insight.

Mr Simaria said: "This technology has allowed us to spread data-driven decision-making to anyone in the organisation, from local teams to global finance to whoever needs to use data insights to make decisions."

UK regulator cautions insurers on big data

24

Nov
2016
Posted By : admin Comments are off
big data analytics
Categories :#AnalyticsNews

The head of the Financial Conduct Authority (FCA) has reminded insurance providers of the need to be careful in their use of big data to ensure some customers are not unfairly penalised.

Speaking at the Association of British Insurers' annual conference, chief executive of the regulator Andrew Bailey noted the ability to capture and convert information into insight has led to a "revolution" in how businesses approach data. However, he cautioned that there need to be boundaries on how this is used to ensure that the technology serves everyone effectively.

The use of big data can allow insurers to determine premiums for consumers at a much more individual level, rather than pooling them into wider risk groups. This puts more emphasis on adjusting policies based on how an individual behaves. For example when it comes to car insurance, it can offer discounts to those who can be determined to be safe drivers.

"That strikes me as a good thing," Mr Bailey said. "It prices risk more accurately, and importantly, it should incentivise improved driving as a means to reduce the insurance premium."  

However, the use of this technology does pose risks, and could be used to penalise some customers – not only those determined to be at higher risk.

For example, Mr Bailey noted that big data may also identify and differentiate between customers who are more likely to shop around for the best price and those more likely to remain with the same insurer for years. He suggested this could be used as a justification to provide more 'inert' customers with higher quotes as they are less likely to switch providers.

These customers therefore pay more and end up subsidising cheaper quotes offered to customers who are more likely to shop around, and Mr Bailey suggested this is where the industry needs to draw the line on the use of big data.

“We are … asked to exercise judgment on whether as a society we should or should not allow this type of behaviour. To simplify, our view is that we should not,” he said.

There have already been questions raised recently about the use of big data in the insurance industry and how it affects customers' privacy. For instance, Admiral recently proposed a new service aimed at first-time drivers that would make decisions about their risk level based on what they posted on Facebook – with certain words and phrases being used as signifiers of personality traits that may translate to greater or lesser risk. 

However, this move was blocked by the social network giant as it would have violated the company's terms of service and privacy policies.

The FCA itself also recently completed a study into the use of big data in the sector, which concluded that despite these concerns, the technology is generally performing well, delivering "broadly positive consumer outcomes".

Mr Bailey noted that the full potential of big data in insurance has yet to be explored – particularly in areas such as life insurance, where the use of information such as genetic data could have "potentially profound" implications for the future of the industry.

It will therefore be up to both regulators and the government to determine how to approach issues such as this. He noted: "Understanding the effect and significance for insurance of big data and how it evolves requires a clear framework to disentangle the issues." 

How Tesco is diving into the data lake

23

Nov
2016
Posted By : admin Comments are off
tesco data lake, big data, forecasting
Categories :#AnalyticsNews

An effective big data analytics solution is now an essential requirement for any large business that wishes to be successful in today's competitive environment, regardless of what sector they are in.

However, one part of the economy that particularly stands to benefit from this technology is retail. These firms have a longstanding tradition of gathering and utilising customer data, so the ability to gain greater insight from the information they already have will play a key role in their decision-making.

One company that has always been at the forefront of this is UK supermarket Tesco. It was noted by Forbes that the company was one of the first brands to track customer activity through the use of its loyalty cards, which allows it to perform activities such as delivering personalised offers.

Now, however, it is turning to technologies such as real-time analytics and the Internet of Things in order to keep up with newer competitors such as Amazon, which is moving into the grocery business.

Vidya Laxman, head of global warehouse and analytics at the supermarket, told the publication: "We are focused on data now and realise that to get where we want to be in five years' time, we have to find out what we will need now and create the right infrastructure."

She added that Tesco is focusing on technologies such as Hadoop, which is central to the 'data lake' model that the company is working towards. This will be a centralised, cloud based repository for all of the company's data, designed to be accessible and useable by any part of the organisation whenever it is needed. 

Ms Laxman explained one challenge for the company has been ensuring that the right data gets to where it needs to go, as different departments often need different information. For example, finance teams need details on sales and forecasts, while the customer side of the business needs data that can be used to inform marketing campaigns.

"We have data scientists in all of our organisations who need access to the data," she said. "That's where Hadoop comes into the picture. We've just started on this journey – we've had data warehousing for some time so there are some legacy systems present and we want to leverage what’s good and see where we can convert to using new strategies."

A key priority for Tesco's activities will be to increase the speed of data processing in order to better support activities such as real-time modelling and forecasting.

Under a traditional way of working, it may take nine or ten months just to ingest the relevant data. Therefore, improving these processes will be essential to the success of big data initiatives.

Another factor helping Tesco is an increasing reliance on open source solutions. Mike Moss, head of forecasting and analytics at Tesco, told Forbes that when he began developing his first forecasting system for the company eight years ago, any use of open source required a lengthy approval process to get it signed off.

"There wasn't the trust there in the software," he said. "It now feels like we're in a very different place than previously … Now we have freedom and all the engineers can use what they need to use, as long as it's reasonable and it makes sense."

NIH highlights use of big data in disease research

21

Nov
2016
Posted By : admin Comments are off
211112 - Image credit: iStockphoto/kentoh
Categories :#AnalyticsNews

The US National Institute of Health (NIH) has highlighted the importance of big data in helping track infectious disease outbreaks and formulating response plans.

In a study published as a supplement in the Journal of Infectious Disease, the body observed that data derived from sources ranging from electronic health records to social media has the potential to provide much more detailed and timely information about outbreaks than traditional surveillance techniques.

Existing methods are typically based on laboratory tests and other data gathered by public health institutions, but these have a range of issues. The NIH noted they are expensive, slow to produce results and do not provide adequate data at a local level to set up effective monitoring.

Big data analytics tools that can process data gathered from internet queries, however, work in real-time and can track disease outbreaks at a much more local level. While the technology does have its own challenges to overcome, such as the potential for biases to emerge, these can be countered by developing a hybrid system that combines big data and traditional surveillance.

Cecile Viboud, PhD, co-editor of the study and a senior scientist at the NIH's Fogarty International Center, said: "The ultimate goal is to be able to forecast the size, peak or trajectory of an outbreak weeks or months in advance in order to better respond to infectious disease threats. Integrating big data in surveillance is a first step toward this long-term goal."

She added that now that proof-of-concepts for the technology have been demonstrated in high-income countries, researchers can examine the impact big data may have in lower-income economies when traditional surveillance is not as widespread.

However, the NIH warned that big data must be handled with caution. For example, organisations must be wary about relying too heavily on data gleaned from non-traditional data streams that may lack key demographic identifiers such as age and sex. They must also recognise and correct for the fact that such sources may underrepresent groups such as infants, children, the elderly and developing countries.  

"Social media outlets may not be stable sources of data, as they can disappear if there is a loss of interest or financing," the body continued. "Most importantly, any novel data stream must be validated against established infectious disease surveillance data and systems."

The NIH's supplement features ten articles that highlight promising examples of how big data analytics is able to transform how disease outbreaks are monitored and responded to.

Experts in computer science, data modelling and epidemiology collaborated to look at the opportunities and challenges associated with three different types of data – medical encounter files, crowdsourced data from volunteers, and information generated by social media, the internet and mobile phones.

Professor Shweta Bansal of Georgetown University, a co-editor of the supplement, stated: "To be able to produce accurate forecasts, we need better observational data that we just don’t have in infectious diseases. There's a magnitude of difference between what we need and what we have, so our hope is that big data will help us fill this gap."

Facebook

Twitter

LinkedId