How long are you really waiting for a bus on the TfL network?
We recently spoke at Big Data London where our focus was about being able to
Read MoreApache Hadoop is an open source software offering that supports the storage and processing of large data sets across a number of computers. My Dzone article describes how a number of Hadoop distributions sprang up to simplify the process of deploying and managing Hadoop, but how do you decide which distribution is right for you?
First of all, you need to decide on your requirements.
You can expect all distributions to have support and training options available, and to have tools that assist with deployment and management of your clusters.
The differences tend to arrive in:
Historically, Hadoop deployments were large in-house clusters of computers, but increasingly people are looking to leverage cloud options in conjunction with, or as a replacement for, those in-house clusters.
With cloud deployment, you have the option to bring up clusters on demand, or flex the size of an existing cluster if your software can cope with that. You also have the potential for significant cost savings if you don’t need to have your largest possible cluster up and running 24×7. However, you might have some business restrictions on being able to put sensitive data into the cloud.
You need to determine if you want to be on premise, or in the cloud, or a hybrid solution.
All distributions build on top of Apache Hadoop to a greater or lesser extent, but some are committed to only distributing open source software, whereas others will include some closed source software as well.
This can be relevant if your company has a policy to e.g. only use open source software for this sort of infrastructure, or if your internal teams strongly prefer to only use open source.
Clusters originally used Map Reduce to process data. However, a number of alternative processing engines are now in use, such as Tez and Spark.
You must determine which data processing approach makes sense for you – bear in mind that e.g. Spark is an in-memory processing engine, and so won’t scale to the same data volumes as Map Reduce, but you can expect it to be significantly faster.
Of course, you can combine different approaches for data processing within one cluster, so this is not a one-size-fits-all issue.
How well does the distribution interact with other components of your IT infrastructure?
Will you still need to move data between your Hadoop cluster and existing database systems, or can you use SQL on Hadoop solutions such as Hive, Impala or Kognitio to process the data without migrating it?
If you adopt and SQL on Hadoop solution, can it provide the functionality and performance your end users expect when used with front end tools like Tableau, Microstrategy and Power BI? You probably want to look at SQL on Hadoop benchmarks, and ideally do your own benchmarking.
Hadoop clusters can use HDFS to store data across a set of nodes, and this gives good resilience against node/disk failure. However, often companies now have their data elsewhere – e.g. in Amazon S3 buckets. You need to decide what data location makes sense to you.
The Hortonworks Data Platform (HDP) is completely open source, and Hortonworks does not develop any proprietary components. They do partner with companies that have non-open source products though.
IBM, Intel, EMC/Pivotal all abandoned their own Hadoop distributions and backed Hortonworks, as reported by Gartner here.
There are a number of enterprise support options available for Hortonworks products, which you can find here.
Cloudera Distribution including Apache Hadoop (CDH) is Cloudera’s distribution. It is predominantly based on open source software, and Cloudera has even migrated some of its proprietary software (e.g. Impala) to be established Apache open source projects. However, there is still some proprietary software included in CDH, including Cloudera Manager for help with installation and management.
Cloudera provide a free Cloudera Express edition, although even some of the components that contains have limited functionality (e.g. Cloudera Manager does not have support for functionality related to security, backup and DR). Other Cloudera editions are sold as a per-year, per-node subscription, with details of what is provided with different editions available here.
Like Hortonworks and Cloudera, MapR provides a distribution that you can use on-premise.
It has its own closed source file system, MapR-FS, which is compatible with HDFS but does not have a NameNode architecture, relying instead on a more distributed approach.
It also offers MapR-DB as an alternative NoSQL database to the open source HBase shipped with other distributions.
MapR provide a free distribution, as well as a paid-for enterprise version, with more details on what is available here.
MapR have a good reputation for performance compared to other distributions.
Amazon Elastic MapReduce (EMR) is a cloud-only solution for use in Amazon.
As you’d expect, it has good integration with Amazon S3 storage.
You pay for what you use, again as you would expect. It is also available with spot pricing which can be very beneficial if it suits your use case.
More information is available on EMR in general here, and current pricing is here.
Be aware of some differences you might not expect when using EMR compared to an on-premise solution – e.g. Hive is available, but Hive LLAP is not supported, as highlighted here.
We recently spoke at Big Data London where our focus was about being able to
Read MoreAfter a busy November at Big Data London and working with various clients I've had
Read More