A Beginner’s Guide to Hadoop®

What is Hadoop?

Apache Hadoop® can be defined as an open source, Java-based data analytics framework for storing and processing extremely large data sets.

It operates across hundreds or thousands of computing nodes or servers that operate in parallel using a distributed file system. This system is known as HDFS (Hadoop Distributed File System). Using HDFS avoids a single disk or node failure leading to a loss of data.

Hadoop technology enables businesses to store massive sets of data, with great processing power to run numerous concurrent jobs, sorting through data at a phenomenal speed. Hadoop technology emerged as a solution to processing big data.

Hadoop is an Apache project sponsored by the Apache Software Foundation.

What is Big Data?

Big Data is a set of data judged by the ‘Three Vs’.

  • Volume – There’s a large amount of data, which can’t be managed with traditional methods.
  • Velocity – The data is updated and increased very quickly.
  • Variety – There is a lot of different types of data which need to be processed and sorted.

A more detailed explanation is that Big data is a massive collection of data, generated in quantities of multi-terabytes or more. It can’t be processed by traditional methods because most of the data is so vast, and is generated in unstructured or semi structured form.

Hadoop technology allows you to store and process large data volumes that otherwise would be cost prohibitive.

What are the best books for Hadoop?

There are loads of Hadoop books out there depending on how proficient you are with it. They range from guides for beginners right up to the more experienced Hadoop users.

  • Hadoop: The Definitive Guide: A step by step guide to building an Apache Hadoop distributed file system. This explains how to set up Hadoop clusters and analyze data.
  • Hadoop in Action: This book explains Hadoop’s functionality, so the theory behind Hadoop. It teaches Hadoop and MapReduce to help analyze big data.
  • Hadoop in Practice: This is a big book with over a hundred different techniques, tutorials, and best practices for Hadoop and big data analysis. This one is not for beginners but rather those with some experience and knowledge of Hadoop and MapReduce.
  • Hadoop in 24 Hours: This is a great book to start with and includes practical exercises that become increasingly difficult as you progress through the book. It teaches the Hadoop environment, HDFS and MapReduce.

What are the best places to learn Hadoop online?

The best way to learn Hadoop is through the many Hadoop courses online. Some are paid-for and some are free. Here are a few of the best online Hadoop training courses:

How do I get Hadoop?

The Hadoop framework is open-source, meaning it can be accessed for free via Apache. However, the process is not simple, and even seasoned computer engineers may struggle to install it correctly (and it may be a time-intensive process). It is simpler to download from one of the main Hadoop distributors, which are:

  • Cloudera – Cloudera builds its data platform on the Apache Hadoop framework. They have a free option, but their paid version gives users access to a variety of proprietary Hadoop tools (which are not open source) and support with managing Hadoop.
  • Hortonworks – Hortonworks uses the Apache Hadoop framework, and have added some tools, but they don’t charge anything for access to the platform and their supporting tools. They do charge for access to consultants and for support in managing Hadoop.
  • MapR – MapR is different in that it offers a slightly different version of Hadoop from the Apache framework. MapR’s version replaced HDFS with its own file system, and adds many enterprise-level features that aren’t available elsewhere.

What are the benefits of Hadoop?

  • Hadoop is scalable – Hadoop can store and process extremely large sets (many thousands of terabytes) of data across many servers that operate in parallel.
  • Hadoop is cost-effective – Hadoop is a cost effective storage solution. The problem with traditional relational database management systems (RDBMS) is that they are extremely expensive to scale up enough to process such massive volumes of data. In the past companies would reduce this cost by deleting raw data. With Hadoop companies don’t have to do that. They can access legacy data at any time, which is extremely valuable in terms of data analytics.
  • Hadoop is flexible – Hadoop handles structured and unstructured data. This means that companies can gain valuable insights from unstructured or raw data which would not be possible without Hadoop.
  • Hadoop is fast – Hadoop’s unique storage method is based on a distributed file system that maps data wherever it is located on a cluster. The tools for data processing are often on the same server as the data, meaning the data doesn’t need to be transferred to the processing tools first. This results in much faster data processing.
  • Hadoop is resilient – One key advantage of Hadoop is its resilience. If any number of nodes break down, no data is lost and the processing can continue. This is because when data is sent to one node, it is replicated to others too, so if one fails, there is always a copy that can be used.

Who uses Hadoop?

Any business with very large data sets has the potential to make use of Hadoop. At present, it’s estimated that 330 of the 500 Fortune 500 companies use Hadoop.

Here are some examples:

  • AOL – AOL uses Apache Hadoop for many things ranging from generating data to running advanced algorithms for behavioral analysis and targeting.
  • Ebay – Ebay uses Apache Hadoop for search optimization and research.
  • Facebook – Facebook uses Apache Hadoop to store and analyze user data as well as machine learning.
  • Spotify – Spotify uses Apache Hadoop for content generation, data aggregation, reporting and analysis. This enables Spotify to work out royalties, recommend tracks to users based on listening habits and measure audience responses to any new features it introduces.
  • Yahoo – Yahoo has incorporated Hadoop into its web crawler for superior indexing and searching of webpages. Yahoo is also the largest user and contributor of Hadoop. Hadoop’s creators, Dough Cutting and Mike Cafarella were working at Yahoo when they created the software.

For a comprehensive list of companies using Apache Hadoop, visit https://wiki.apache.org/hadoop/PoweredBy


Is Hadoop a database?

No, Hadoop is not a database. Hadoop is a repository for data. It operates on a distributed file system (HDFS). You can use the file system to store information and to extract it, similar to a database. The difference is that you have to format data to put it into a database. With Hadoop, you can put raw data in no matter what the format is.

Is Hadoop a language?

No, Hadoop is not a language. Programming languages including Java and SQL are used to query Hadoop data.

Is Hadoop dead?

No, Hadoop is not dead. Hadoop is open source and constantly evolving. It is a software technology which sits on a cluster of distributed computers that have a process applied, and the various components change all the time.

Is Hadoop easy to learn?

No, Hadoop is not easy to learn, mainly because of the extensive engineering resource that is required to implement Hadoop into a business. It is no longer required to know Java, although it helps as there are resources that can help setup Hadoop.

Recommended resources for learning are:

Is Hadoop free?

Yes, Apache Hadoop is open source and free. But there are several supported distributions, some of which have associated fees.

Is Hadoop java based?

Yes. The Hadoop framework and eco system is mostly written in the Java programming language.

Is Hadoop noSQL?

No, Hadoop is an ecosystem of software to enable the storage and processing of Big Data. NoSQL, also known as “Not Only SQL” is a term used to describe databases with some particular features for querying Big Data quickly. So you can use a NoSQL database on top of Hadoop, but Hadoop itself is not NoSQL in the same way that it is not a relational database, nor an object database.

Is Hadoop open source?

Yes, Hadoop is an open source technology. This means it can be accessed for free via Apache, but the process is not easy and often requires an experienced computer engineer with the help of online resources to download and implement it.

What is Hadoop Federation?

Hadoop Distributed File System (HDFS) Federation improves the existing HDFS architecture. It separates the namespace which is the directory of data from the storage, the data itself.  In a traditional HDFS structure, there was only one namespace for the entire cluster. Hadoop Federation allows multiple namespaces in the cluster which improves scalability and isolation. Hadoop Federation also opens up the architecture, allowing for new implementations and use cases.

What is the best laptop for Hadoop?

It is unlikely that you would run Hadoop on a single laptop, but you could test functionality rather than scale/performance as you would have a very small cluster. You could also access cloud-based Hadoop services from your laptop. In that case, the recommendation is to look for laptops with the following attributes:

  • A good processor with a lot of memory
  • Compatible with Linux

What are the features of Hadoop?

Here are a few key features of Hadoop:

  • Hadoop is flexible – Hadoop manages data whether structured or unstructured, encoded or formatted, or any other type of data. One of the biggest challenges of the past was handling unstructured data. Hadoop can do this and therefore is hugely valuable as unstructured data that was previously ignored can now be used in the decision making process.
  • Hadoop is scalable – This is a major feature of Hadoop, meaning that it can store and process thousands of terabytes in parallel across many nodes. If your business requires more data to be stored, then additional nodes or servers can be added easily.
  • Hadoop is resilient – Technology is not infallible. That is why this feature is particularly attractive. If a server or node breaks down, that doesn’t mean that the whole Hadoop process fails. This is because when data is sent to an individual node, a copy is made and sent to another. So if one node fails, there is always a copy and the process can continue unhindered.
  • Hadoop is fast – Hadoop is extremely good at high-volume batch processing because of its ability to do parallel processing.
  • Hadoop is cost effective – Hadoop generates cost benefits by bringing massively parallel computing to low cost servers, resulting in a substantial reduction in the cost per terabyte of storage.

How does querying work in Hadoop?

Hadoop was originally conceived as a platform that required software engineers to write code to access or query data. Initially this was via a framework called MapReduce and more latterly by a number of alternative engines. For business users to query or access data in Hadoop, via client tools such as Tableau, would normally require the Hadoop implementation to support SQL.

How does streaming work in Hadoop?

In Hadoop, streaming means that data is being processed as it enters the data lake. As data is being queued, Hadoop is analyzing it and constantly recalculating. It is extremely quick but not quite real time yet.

How does Hadoop work?

Hadoop allows you to store and process large data volumes that otherwise would be cost prohibitive. There are two aspects of Hadoop. The way it stores data and the way it processes data.

  • It uses Massively parallel processing (MPP), splitting the problem into components. Initially, Hadoop processed data using a programming engine called MapReduce. MapReduce allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. It’s no longer as relevant because it is too complicated, difficult and slow. There are now lots of other engines available for Hadoop.
  • Hadoop has a distributed file system (HDFS) which enables the storage of very large data files with multiple copies for redundancy. The Hadoop Distributed File System is designed so that node failures are automatically handled. It stores multiple copies of the data by default, so if one copy is lost due to a node/disk failure the remaining copies can be used to generate a replacement copy.

What is the Hadoop block size?

Hadoop Distributed File System (HDFS) stores files as data blocks and distributes these blocks across the entire cluster. As HDFS was designed to be fault-tolerant and to run on commodity hardware, blocks are replicated a number of times to ensure data is not lost if there is a failure. Hadoop allows businesses to make as many replicas of one block as is deemed necessary.

The default size of the Hadoop block is 128 MB.

What are the main Hadoop components?

The Hadoop Ecosystem is comprised of 4 core components –

  • Hadoop Common – Hadoop Common is the Hadoop Core. It is a collection of common utilities and libraries that support other Hadoop modules. It is central to the Apache Hadoop Framework.
  • Hadoop Distributed File System (HDFS) – HDFS is a distributed file system that enables access to data across Hadoop clusters.
  • MapReduce – MapReduce was the original data processing engine that came with Hadoop. It allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. MapReduce refers to two different processes: mapping and reducing the data. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from the map job as input and combines those data tuples into a smaller set of tuples. It essentially reduces the data in order to extract the required data. It’s no longer as relevant because it is too complicated, difficult and slow. There are now lots of other engines available for Hadoop.
  • YARN – YARN stands for “Yet-Another-Resource-Negotiator”. It allows multiple Hadoop applications to share a common set of resources.

What is Hadoop data lake?

The nature of Hadoop means that data stored does not need to be formatted in any particular way. It is stored in a variety of formats, and then processed later. In this unprocessed format it is often called a ‘Data Swamp’. When it has been processed and cleaned up, it is called a ‘Data Lake’.

What is Hadoop's ecosystem?

The Hadoop ecosystem is a set of technologies and solutions designed to run on a common platform that allow a Hadoop cluster to perform a whole range of data-related activities.

The Hadoop ‘ecosystem’ refers to the variety of products which have been developed to interact with and improve upon Hadoop’s processing capabilities. These include many open source tools like Spark, Hive, Pig, Oozie and Sqoop. There are also commercial Hadoop offerings from vendors such as Cloudera, Hortonworks and MapR, as well as other Hadoop eco-system solutions such as Kognitio, Presto, Jethro Data.

What is a Hadoop cluster?

Hadoop ‘cluster’ describes all the individual physical computers or servers (nodes) that have been networked together in order to store and analyze huge amounts of unstructured and structured data. Hadoop is run across the cluster which can be hundreds and thousands of low-cost commodity computers or servers.

One server in the cluster is designated as the NameNode and another machine is the JobTracker; these are the masters. The rest of the servers in the cluster act as both DataNode and TaskTracker; these are the slaves. Additional servers can be added on at any time if there is a growing amount of data. Most importantly, in this process, data is copied onto other cluster nodes so that if there is a failure in one of the servers, no data is lost.

What is the architecture of Hadoop?

Hadoop combines many individual low cost industry-standard servers to create a powerful distributed data storage and processing platform.

A basic Hadoop cluster will usually use HDFS as the data storage layer combined with some sort of MPP data processing engine. Initially Hadoop used MapReduce as its data processing and query engine although this has now largely been superseded by more performant solutions.

Hadoop follows a Master Slave architecture. The master node for data storage in Hadoop HDFS is the NameNode, and the master node for parallel processing of data in Hadoop MapReduce is the Job Tracker. The slave nodes are the other servers in the Hadoop cluster which store data and perform complex computations. Every slave node has a Task Tracker daemon and a DataNode that synchronizes the processes with the Job Tracker and NameNode respectively. The master or slave systems can be setup in the cloud or on-premise.

Hadoop 2.0 introduced YARN (Yet another resource negotiator) which allows multiple Hadoop applications to share a common infrastructure.

What is the Hadoop stack?

The Hadoop stack is the underlying HDFS technology plus any other technology from the Hadoop ecosystem that can be added on to it in order to enhance its performance depending on the business needs and uses for the data.


Image credit to edureka

Will Hadoop replace data warehousing?

Traditionally the view has been that Hadoop cannot replace data warehousing because Hadoop is too slow to query for business users.

However there are now high-performance SQL query engines available for Hadoop that can allow Hadoop to provide better interactive query performance than a traditional data warehouse solution.

What is the latest version of Hadoop?

Hadoop 3.0 was released in December 2017

When did Hadoop come out?

The first version of Hadoop was released by Doug Cutting and Mike Cafarella in April 2006.

What are the past versions of Hadoop?

  • Apache Hadoop 3.0 Released 13th December 2017
  • Apache Hadoop 2.9 Released 17th November 2017
  • Apache Hadoop 2.8 Released 22nd March 2017
  • Apache Hadoop 2.7 Released 21st April 2015
  • Apache Hadoop 2.6 Released 18th November 2014
  • Apache Hadoop 2.5 Released 11th August 2014
  • Apache Hadoop 2.4 Released 07th April 2014
  • Apache Hadoop 2.3 Released 20th February 2014
  • Apache Hadoop 2.2 Released 15th October 2013
  • Apache Hadoop 1.0 Released 27th December 2011
  • Apache Hadoop 0.1.0 released

Where did the name Hadoop come from?

The co-founder of Hadoop, Doug Cutting named Hadoop after his son’s toy elephant.

Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Phoenix, NiFi, Nifi Registry, HAWQ, Zeppelin, Slider, Mahout, MapReduce, HDFS, YARN, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries. All other trademarks and registered trademarks are the property of their respective owners.