Posted By : admin 1 Comment
kognitio benchmark tests
At the recent Strata conference in New York we received a lot of interest in the informal benchmarking we have been carrying out that compares Kognitio on Hadoop to some other SQL on Hadoop technologies. We have decided to formalise the benchmarking process by producing a paper detailing our testing and results. In the meantime, we will be releasing intermediate results in this blog. Preliminary results show Kognitio comes out top on SQL support and single query performance is significantly faster than Impala. Read on for more details.

It is clear from recent conversations that many organisations have issues using the tools in the standard Hadoop distributions to support enterprise level SQL on data in Hadoop. This is caused by a number of issues including:

  • SQL maturity – some products cannot handle all the SQL generated by developers and/or third party tools. They either do not support the SQL, or produce very poor query plans
  • Query performance – queries that are supported perform poorly even under single user workload
  • Concurrency – products cannot handle concurrent mixed workload well in terms of performance and give errors when under load

Bearing in mind the types of workload we have been discussing (primarily BI and complex analytics) we decided to initially concentrate on the TPC-DS benchmark. This is a well-respected, widely used query set that is representative of the type of query that seems to be most problematic. The TPC framework is also designed for benchmarking concurrent workloads.

Currently we are testing against Hive, Impala and SparkSQL as delivered in Cloudera 5.7.1 using a 12 node cluster. We will shortly be upgrading our test cluster to the most recent release of Cloudera before running the main benchmarks for the paper. We have also done some initial testing of SparkSQL 2.0 on a small HortonWorks cluster and plan to be including the Cloudera beta of SparkSQL 2.0 in the performance tests.

SQL Maturity

A common theme we’ve heard is that one of the major pain points in Hadoop adoption is the need to migrate existing SQL workloads to work on data in Hadoop. With this in mind we initially looked at the breadth of SQL that each product will execute before moving onto performance. We have categorised each of the 99 TPC-DS queries as follows

  • Runs “out of the box” (no changes needed)
  • Minor syntax changes – such as removing reserved words or “grammatical” changes
  • Long running – SQL compiles but query doesn’t come back within 1 hour
  • Syntax not currently supported

If a query requires major changes to run, it is considered not supported (see the TPC-DS documentation).

Technology Out of the Box Minor Changes Long Running Not Supported
Kognitio on Hadoop 76 23
Hive 1 30 8 6 55
Impala 55 18 2 24
Spark 1.6 39 12 3 43
Spark 2.0 2 72 25 1 1

The above table shows that many products have a long way to go and the step change in SQL supported in Spark 2.0 (from 1.6) shows the developers have recognised this. Kognitio and other technologies that are making the move from the analytical DWH space are at a distinct advantage here as they already possess the mature SQL capability required for enterprise level support.

Query Performance

The results shown right are for a single stream executing over 1TB of data but our goal is to look at concurrent mixed workloads typically found in enterprise applications.

As well as supporting all 99 queries (23 with small syntax changes) initial results for a single query stream show Kognitio is very performant compared to Impala. Kognitio runs 89 out of the 99 queries in under a minute whereas only 58 queries run in under a minute on Impala. However we recognise the real test comes in increasing the number of streams so watch this space as we increase concurrency and add Hive and Spark timings too.

sql on hadoop benchmark tests

A bit about how we run the tests

We’ve developed a benchmarking toolkit based around the TPC framework which can be used to easily test concurrent query sets across technologies on Hadoop platforms. We designed this modular toolkit to allow testers to develop their own benchmark test and are planning to make the toolkit available on github in the coming weeks once we have finished some “How to Use” documentation.

In progress and to come

As I write this we are still looking at a few interim results presented here:

1. Need to complete syntax changes for Hive so these figures may change in the final paper

2. The single query that is not supported by Spark 2.0 did execute but a Cartesian join was used leading to incorrect results.

We are planning to move on to full concurrent workloads in the next week and will publish these and the toolkit soon.

Pushing past pilot stage ‘key challenge’ for big data projects


Posted By : admin Comments are off
111016 - Image credit: iStockphoto/ipopba
Categories :#AnalyticsNews

Many companies that are embarking on big data analytics projects will find themselves struggling to move beyond the pilot stage and roll out their initiatives on a wider scale, a new survey has found.

Research by Gartner identified this as one of the key challenges facing businesses when they undertake big data projects. It revealed that although nearly three-quarters of organisations are planning to invest in this technology or have already done so, just 15 per cent say they have deployed their big data projects at full production scale.

This is almost unchanged from when the same question was asked last year, when 14 per cent of firms stated they had achieved this.

Nick Heudecker, research director at Gartner, suggested one reason for this may be that big data initiatives are having to compete with other IT investments and are often treated as a lower priority.

Indeed, just 11 per cent of IT professionals at organisations that have invested in big data believed these initiatives were as important, or more important, than other projects, while nearly half (46 per cent) said they were less important.

"This could be due to the fact that many big data projects don't have a tangible return on investment that can be determined upfront," Mr Heudecker said. He added: "Another reason could be that the big data initiative is a part of a larger funded initiative. This will become more common as the term 'big data' fades away, and dealing with larger datasets and multiple data types continues to be the norm."

Another issue that can make it difficult for companies to move beyond the pilot stage is they do not have effective business leadership and involvement in these projects. What's more, many trial initiatives are developed using ad-hoc technologies and infrastructure that do not have the scalability or reliability needed for production-level deployment.

Overall, Gartner's survey revealed investments in big data continue to rise, with 48 per cent of organisations investing in the technology in 2016. This marked an increase of three percentage points from a year earlier.

However, the number of companies planning new investments dropped from 31 per cent in 2015 to 26 per cent in this year's survey.

Mr Heudecker said these signs of slowing growth may be an indicator that companies are rethinking how they look at big data analytics and integrate it into their operations.

"The big issue is not so much big data itself, but rather how it is used," he said. "While organisations have understood that big data is not just about a specific technology, they need to avoid thinking about big data as a separate effort."

One trend is that businesses are no longer thinking about the technology in vague terms, but are looking at specific problems and opportunities that it can address. As such, the success of such initiatives will depend on a "holistic strategy around business outcomes, skilled personnel, data and infrastructure", Mr Heudecker continued.

Logistics sector sees high demand for big data


Posted By : admin Comments are off
300916 - Image credit: iStockphoto/Maxiphoto
Categories :#AnalyticsNews

Demand for big data analytics solutions in the logistics and supply chain sector is growing rapidly, with almost all firms now recognising the need for these solutions.

This is according to a new study conducted by Capgemini Consulting, Penn State University and Penske Logistics, which revealed 98 per cent of third-party logistics companies (98 per cent) agreed that data-driven decision-making will be essential to the future success of supply chain processes. This view was shared by 93 per cent of shippers.

Both of these groups also stated that being able to use big data analytics effectively will become a core competency for their supply chain organisations in the coming years. Some 81 per cent of shipping firms and 86 per cent of third-party logistics and outsourcing companies agreed with this.

Tom McKenna, senior vice-president of engineering and technology at Penske Logistics, said: "Data-driven decision-making is certainly an increasing trend in the supply chain."

However, he added: "Among the biggest challenges that come with increased visibility and more data is determining how to best use that information to drive improvements that benefit the customer."

Shipping firms that can successfully turn this data into useful insight stand to gain a significant competitive advantage over their less well-equipped peers, the study said.

Six out of ten shipping companies (60 per cent) said improving integration in their supply chain was a key area where big data is expected to boost performance. Meanwhile, 55 per cent said the technology would help them improve the quality of their data, and 53 per cent added it could improve the performance and quality of their processes.

For third-party logistics firms, the benefits were slightly different. More that seven out of ten of these firms (71 per cent) said the greatest value data provides comes from improving process quality and performance, while 70 per cent cited improving logistics optimisation as among its most important benefits, and 53 per cent named improving integration across the supply chain. 

Big data is also expected to be highly useful in tackling the challenges created by issues such as a tightening of trucking capacity. Nearly three-quarters of shippers (71 per cent) said big data analytics from third-party firms helps them to better understand alternative shipping possibilities, while 61 per cent said they valued data about trade routes and costs that their partners could provide.

Fluctuating capacity, increased shipper demands and disruptions within the industry are creating a volatile decision-making environment for shippers and logistics providers trying to optimise the supply chain,” the study noted. "Both parties are increasingly using information and analytics to drive their decisions."

However, the report did highlight a difference in opinion between shipping companies and third-party providers when it comes to the benefit of big data. While 79 per cent of shippers said their supply chain organisation sees significant value in the use of big data, this compares with 65 per cent of outsourcers who reported that their customers see such value.

IoT projects moving beyond pilot stage, study finds


Posted By : admin Comments are off
280916 - Image credit: iStockphoto/kentoh
Categories :#AnalyticsNews

More businesses are moving beyond small-scale pilot schemes when it comes to the Internet of Things (IoT), towards full-scale deployments that incorporate big data analytics, cloud computing and security capabilities.

This is according to new research from International Data Corporation (IDC), which revealed that almost a third of firms (31.4 per cent) have already launched solutions that take advantage of this technology, while a further 43 per cent expect to deploy these tools in the next 12 months.

More than half of respondents (55 per cent) also agreed that the technology will be an essential strategic solution that helps their business to compete more effectively. Key benefits of IoT solutions include better productivity, lower costs and the automation of internal processes.

Carrie MacGillivray, vice-president for Mobility and Internet of Things at IDC, noted that vendors that can offer an integrated cloud and data analytics solution will be seen as vital partners when organisations are investing in IoT.

Given the huge volume and variety of data that IoT deployments are expected to create in the coming years, being able to effectively analyse this and derive insight in a timely manner is essential. Therefore, having strong analytics tools is an essential part of a good IoT project.

However, this means having people with the right skills and knowledge to make the most of this – and this is something that many businesses are currently lacking.

IDC's research found that a lack of internal skills in this area is a challenge that is hindering many initiatives. This was named as one of the top worries facing decision makers along with privacy/security issues and the costs of implementing IoT solutions.

The company also found that as the benefits of IoT become clearer, the technology is more likely to be embraced by both IT departments and business units.

Vernon Turner, senior vice-president of Enterprise Systems and IDC Fellow for the Internet of Things, commented: "Setting strategies, finding budgets, and supporting IoT solutions have contributed to an ongoing tussle between line of business executives (LOBs) and CIOs. However, that race may be over, because in many cases LOBs are now both leading the discussions and either paying in full or sharing the costs of IoT initiatives with the CIOs."

Customers ‘unaware’ of how firms use their data


Posted By : admin Comments are off
280916 - Image credit: iStockphoto/BernardaSv
Categories :#AnalyticsNews

The vast majority of customers have no idea how businesses are using the personal data they possess, while many also do not trust companies to be responsible in their use of this information.

This is according to new research from the Chartered Institute of Marketing (CIM), which revealed nine of of ten customers are in the dark about how their personal data is used. Meanwhile, 57 per cent did not trust companies to take good care of their data, while 51 per cent complained they had been contacted by organisations that misuse their data.

For many businesses, gathering more detailed and personal information on their customers is a vital part of their big data analytics strategy, as it allows them to tailor their offerings and marketing messages more precisely. This provides a better experience for customers and helps boost business' revenue.

Overall, people are happy for certain data to be used by companies, provided they understand what this will involve. More than two-thirds of consumers (67 per cent) stated they would share more personal information if organisations were more transparent about their plan for it.

Chief executive of the CIM Chris Daly commented: "The solution is clear, marketers need to brush up on the rules, demonstrate clearly the value-add personal data offers in delivering a more personalised experience and ultimately reduce the fear by being open throughout the process."

However, businesses will have to think carefully about how they achieve this transparency. Many may believe that simply updating terms and conditions to explain what they do with data will be adequate, but the CIM's research suggests this may not be good enough to satisfy many consumers.

Only 16 per cent of respondents stated they always read terms and conditions, with many put off by the lengthy and often confusing documents. Therefore, businesses will need to find simpler, clearer ways of communicating to their customers about how they use data.

The research also found there is a mismatch between consumers and businesses when it comes to what data they view as acceptable to share. For instance, more than seven out of ten consumers (71 per cent) stated they were not comfortable with businesses tracking their location via their smartphones. However, one in five businesses are already using geolocation tools to do this.

More that two-thirds (68 per cent) of consumers also expressed reservations about providing information from their social media platforms – something 44 per cent of firms use in their marketing analytics.

One of the biggest concerns that users have is that their information may fall into the wrong hands – either as a result of being sold on to third parties or compromised in a data breach.

"People are nervous about sharing personal data – fears of data breaches and misuse has them on high alert," Mr Daly said.

There have been a series of high-profile incidents in recent months and years that have compromised the personal details of consumers, the most recent of which came when Yahoo! admitted the information of up to 500 million users was stolen in 2014.

Therefore, businesses will have to address these fears if they are to get the necessary buy-in from customers to make personalised, big data-based marketing an effective solution.

FCA approves use of big data for insurers


Posted By : admin Comments are off
230916 - Image credit: iStockphoto/tonefotografia
Categories :#AnalyticsNews

The UK's Financial Conduct Authority (FCA) has signalled its approval of the use of big data in the insurance industry, after it revealed it was dropping plans to launch a full inquiry into the use of the technology in the industry.

The regulator announced a review into the sector last year, stating it was aiming to better understand how the use of big data analytics in areas such as calculating premiums impacted consumers, before deciding on the next steps.

It has now stated that in light of the "broadly positive consumer outcomes" that big data can deliver to the sector, it will not be proceeding with a full market study at the present time, which effectively gives the green light to insurers to implement big data into their decision-making.

However, the regulator did add a note of caution, observing that there are some areas where big data has the potential to harm consumers. Specifically, it noted that as big data changes the extent of risk segmentation, this may lead to categories of customers finding it harder to obtain insurance. 

The FCA also raised concerns about the potential that big data might make it easier for firms to identify opportunities to charge certain customers more.

Director of strategy and competition at the FCA Christopher Woolard noted that as the general insurance sector is a vital part of the economy, affecting millions of consumers, it is essential that it works well.

"There is potential for big data to transform practices across general insurance markets, and some consumers are already seeing benefits but there are also some risks to consumer outcomes," he said.

"While we have decided not to launch a full market study, we are undertaking further work in this area and with the Information Commissioner's Office (ICO) to ensure our rules and policies keep pace with developments in the market, but also do not prevent positive innovations."

The FCA's Call for Input found that although big data is able to improve general consumer outcomes, it can also affect how companies determine their pricing. It suggested that as insurers gather increasing amounts of data from a wider range of sources, and apply sophisticated analytical tools to this, it may lead to the use of reasons other than risk and cost in pricing becoming more common throughout the industry.

While it recognised the potential that consumers who are deemed to be at higher risk may be denied coverage, the review has not shown any signs that this is occurring. "However, the FCA will remain alert to the potential exclusion of higher risk customers and will engage with government if concerns begin to develop because of how firms are using big data," the regulator stated.

The FCA also reminded insurers of their responsibilities to consumers when it comes to ensuring their use of data is in line with security regulations and legislation such as the Data Protection Act. The FCA will be co-hosting a roundtable with the ICO later this year on how data should be used in the general insurance sector.

New programming language ‘boosts big data speeds fourfold’


Posted By : admin Comments are off
230916 - Image credit: iStockphoto/cifotart
Categories :#AnalyticsNews

A new programming language developed by researchers at the Massachusetts Institute of Technology (MIT) is claiming to be able to increase the speed of big data processing by up to four times.

Called Milk, the language allows application developers to manage memory more efficiently in programs that deal with scattered data points in large data sets. In tests conducted using several common algorithms, programs written in the new language were shown to be four times as fast as those written in existing languages, but the researchers behind the language believe that further work will result in even larger gains.

Milk is intended to solve one of the biggest barriers to successful implementation of big data analytics processes – how efficiently programs gather the relevant data.

MIT explained that traditional memory management is based on the 'principle of locality' – that is, if a program requires a certain piece of data stored in a specific location, it is also likely to need the neighbouring data, so it will fetch this at the same time.

However, this assumption no longer applies in the era of big data, where programs frequently require scattered chunks of data that are stored arbitrarily across huge data sets. Since fetching data from their main memory banks is the major performance bottleneck in today’s computer chips, having to do this more frequently can lead to major performance issues.

Vladimir Kiriansky, a PhD student in electrical engineering and computer science and first author on the new paper, explained that returning to the main memory bank for each piece of information is highly inefficient. He said: "It's as if every time you want a spoonful of cereal, you open the fridge, open the milk carton, pour a spoonful of milk, close the carton, and put it back in the fridge."

The new programming language aims to overcome this limitation through the use of batch processing, by adding a few commands to OpenMP, an extension of languages such as C and Fortran that makes it easier to write code for multicore processors. 

When using the language, a programmer then adds a few lines of code around any instruction that iterates through a large data collection looking for a comparatively small number of items. Milk’s compiler then figures out how to manage memory accordingly.

Using Milk, when a core needs a piece of data, instead of requesting it – and any adjacent data – from main memory, it adds the data item's address to a locally stored list of addresses. When this list is long enough, all the chip's cores pool their lists, group together those addresses that are near each other and redistribute them to the cores. This means that each core requests only data items that it knows it needs and that can be retrieved efficiently.

Matei Zaharia, an assistant professor of computer science at Stanford University, noted that although many of today's applications are highly data-intensive, the gap in performance between memory and CPU means they are not able to fully utilise current hardware.

"Milk helps to address this gap by optimising memory access in common programming constructs. The work combines detailed knowledge about the design of memory controllers with knowledge about compilers to implement good optimisations for current hardware," he added.

Hadoop and NoSQL drive big data boom


Posted By : admin Comments are off
200916 - Image credit: iStockphoto/bakhtiar_zein
Categories :#AnalyticsNews

Investments in technologies such as Hadoop and NoSQL will underpin much of the growth in the big data analytics market in the coming years, with non-relational solutions set to increase at around twice the rate of the sector as a whole.

This is according to a new report from Forrester Research, which found that over the next five years, big data will grow at a rate of around 13 per cent a year. However, NoSQL is set for a compound annual growth rate of 25 per cent over the period between 2016 and 2021, while the projected figure for Hadoop is 32.9 per cent.

It also noted that this year, some 30 per cent of organisations have implemented Hadoop, up from 26 per cent in 2015. Meanwhile, 41 per cent of professionals stated they had already implemented NoSQL or were expensing its use.

A further 20 per cent expect to undertake a NoSQL deployment in the next year. The report observed this is proving to be particularly useful for applications such as ecommerce and graph data databases, while open source options can help companies reduce the cost of their big data initiatives.

Analyst at Forrester and author of the report Jennifer Adams said: "Five years ago, big data was still a buzzword, but today, it's a standard business practice."

She added one of the key reasons for Hadoop's robust forecast is its ability to run data-intensive applications that legacy solutions would be unable to handle. For example, the report highlighted Arizona State University's Hortonworks cluster, which is used to store four petabytes of cancer genome data, as one scenario where Hadoop is breaking down barriers to research.

The need to manage huge volumes of data is also a key driver for NoSQL implementations, Ms Adams stated. For instance, eBay uses a MongoDB deployment that is able to store up to one billion live listings, while PayPal uses Couchbase to handle databases of one billion documents.

Cloud computing is another area set for growth, and is a major factor driving interest in Hadoop. The report noted: "Hadoop in the cloud allows the analysis of more data using cheaper infrastructure and enables faster advanced analytics."

As a result, the number of organisations using a cloud-based service to store unstructured data has increased from 29 per cent in 2015 to 35 per cent this year.

Elsewhere, a separate report from Forrester has identified 15 emerging technologies that are set to have a huge impact on the world over the next five years, and it is clear that being able to effectively analyse data will be critical if businesses are able to take full advantage of many of them.

For example, the Internet of Things was named by Forrester as one of the top five innovations that will change the world by 2021. It noted this will drive "new levels of customer insight and engagement" by the end of the forecast period, but some firms will need to undergo organisational change in order to make the most of this.

Augmented and virtual reality, intelligent agents, artificial intelligence, and hybrid wireless were the other top five technologies named by the research firm.

How big data helps keep trains running on time


Posted By : admin Comments are off
150916 - Image credit: iStockphoto/ipopba
Categories :#AnalyticsNews

Siemens – the largest engineering company in Europe – is trying to improve punctuality on Germany’s rail network by utilising big data analysis and predictive maintenance.

The most recent data from the European Commission reveals that just over three-quarters (78.3 per cent) of the country’s long distance trains arrived on time. The report, published in 2014, found that just two other countries racked up worse averages, with Portugal and Lithuania coming in behind.

But it is hoped that the introduction of big data analytics will change this, the Financial Times reports. A group of Siemens employees have fitted hundreds of sensors to the trains, which relay data back to the engineers about the condition of the locomotive’s parts.

By combining two industrial disciplines – big data and predictive maintenance – the firm is able to find out what needs to be repaired or replaced before any delays are caused. This means that punctuality could be pumped back into Germany’s rail network, helping it climb the ranks and keep commuters happy.

So far, this experiment has seen all but one of the 2,300 journeys monitored by Siemens in Spain arrive less than five minutes late. This has pushed up the punctuality rate from 89.9 per cent up to 99.98 per cent, beating leader Finland’s score of 95.4 per cent.

This is not a new idea for Siemens, as it realised back in 2014 that the Internet of Things could help it provide customers with more than just hardware, as it paired together sensors and connected devices.

From there, the company decided to move its train manufacturing site from Allach to outside Munich to a digital hub. Here a group of specialists analyse the data that is generated by the sensors on the trains being monitored by the firm.

The group is looking out for patterns or anomalies that could point to an issue onboard one of its fleet. If something does need to be replaced, the specialists make sure it is done during regular maintenance checks, rather than cause a disruption to regular services.  

Renfe – the Spanish rail network that has partnered with Siemens – is so confident that the system works that it is offering commuters a refund if the service between Madrid and Barcelona is late by more than 15 minutes.

Gerhard Kress, head of Siemens' Mobility Data Services Centre, believes that the most important thing for rail networks is to avoid breakdowns, as just one can have a ripple effect and cause several services to then be delayed.
Mr Kress believes that his team has got the knowhow to make big data work hard to keep the trains running on time. He added: “We are essentially building on the know-how that Siemens has developed over the years for other types of applications, namely in healthcare and gas turbine operations.

“We have also massively invested in building our team. All our scientists not only have PhDs in data science, machine learning or mathematics, but also a background in mechanical engineering.”

Shop Direct highlights big data’s impact as profits rise


Posted By : admin Comments are off
13/09/16 - Image credit: iStockphoto/emyerson
Categories :#AnalyticsNews

Online retailer Shop Direct has highlighted its investments in big data and machine learning as among the reasons for its recent success, as it unveiled revenue of £1.86 billion for the first full year since the implementation of these technologies.

Computing magazine reports that the company, which runs brands including Littlewoods.com and Very.co.uk, reported profits of £150.4 million for the period, an increase of 43 per cent year-on-year.

Chief executive of the group Alex Baldock said this success has been down to a greater focus on new technology, and this is something that the firm will continue with in the coming years.

“This was the year our investments in technology really started to pay off," he said. "We've made big strides in m-commerce, big data and personalisation. But there's a lot more to play for in these areas."

For instance, he highlighted artificial intelligence as a solution that can "change the game" for the company when it comes to how it uses data. Mr Baldock said that Shop Direct has already begun to deploy this technology and is serious about taking it much further.

In its report, Shop Direct highlighted how it uses machine learning technology to improve its offerings, such as delivering personalised recommendations to its customers based on their buying habits.

“Driven by machine learning, the group is now personalising more customer touchpoints than ever, from customer emails and off-site advertising to homepage content, on-site sort orders, top navigation menus and product recommendations deeper within the shopping journey," the company stated.

It has also begun trialling more personalised services using this data, in order to build "deeper relationships" with its customers.

Shop Direct has made big data analytics a key part of its business since it abandoned its traditional print catalogue-based business in January 2015 in favour of a completely digital offering.

Earlier this year, chief executive of financial services at the company Neil Chandler explained to Computing how it has spent six years transforming its offering from a catalogue firm into a "world-class" leader in digital retail.

Its efforts include a personalised sort order tool, which compiles a list of suggested products based on a user's history and ensures these appear at the top of the user's search results page.

This is something that's particularly important as more of the firm's customers switch to mobile devices, where space is at a premium. Mr Chandler explained: "On mobile, people aren't going to keep swiping down if they are looking for a black dress and there are 100 to choose from – they'll probably see nine at best.

"So the aim is to work out how analytics can help to curate and show the best nine black dresses for the customer that are in stock, that match the fashion preferences and are in the right price range."

Tools such as this will be hugely valuable in the coming years, as Shop Direct's results indicate there has been a 46 per cent increase in the number of orders placed via smartphones in the last year, while the company's apps have been downloaded over a million times across Android and iOS.