Apache Hadoop® can be defined as an open source, Java-based data analytics framework for storing and processing extremely large data sets.
It operates across hundreds or thousands of computing nodes or servers that operate in parallel using a distributed file system. This system is known as HDFS (Hadoop Distributed File System). Using HDFS avoids a single disk or node failure leading to a loss of data.
Hadoop technology enables businesses to store massive sets of data, with great processing power to run numerous concurrent jobs, sorting through data at a phenomenal speed. Hadoop technology emerged as a solution to processing big data.
Hadoop is an Apache project sponsored by the Apache Software Foundation.
Big Data is a set of data judged by the ‘Three Vs’.
A more detailed explanation is that Big data is a massive collection of data, generated in quantities of multi-terabytes or more. It can’t be processed by traditional methods because most of the data is so vast, and is generated in unstructured or semi structured form.
Hadoop technology allows you to store and process large data volumes that otherwise would be cost prohibitive.
There are loads of Hadoop books out there depending on how proficient you are with it. They range from guides for beginners right up to the more experienced Hadoop users.
The best way to learn Hadoop is through the many Hadoop courses online. Some are paid-for and some are free. Here are a few of the best online Hadoop training courses:
The Hadoop framework is open-source, meaning it can be accessed for free via Apache. However, the process is not simple, and even seasoned computer engineers may struggle to install it correctly (and it may be a time-intensive process). It is simpler to download from one of the main Hadoop distributors, which are:
Any business with very large data sets has the potential to make use of Hadoop. At present, it’s estimated that 330 of the 500 Fortune 500 companies use Hadoop.
Here are some examples:
For a comprehensive list of companies using Apache Hadoop, visit https://wiki.apache.org/hadoop/PoweredBy
No, Hadoop is not a database. Hadoop is a repository for data. It operates on a distributed file system (HDFS). You can use the file system to store information and to extract it, similar to a database. The difference is that you have to format data to put it into a database. With Hadoop, you can put raw data in no matter what the format is.
No, Hadoop is not a language. Programming languages including Java and SQL are used to query Hadoop data.
No, Hadoop is not dead. Hadoop is open source and constantly evolving. It is a software technology which sits on a cluster of distributed computers that have a process applied, and the various components change all the time.
No, Hadoop is not easy to learn, mainly because of the extensive engineering resource that is required to implement Hadoop into a business. It is no longer required to know Java, although it helps as there are resources that can help setup Hadoop.
Recommended resources for learning are:
Yes, Apache Hadoop is open source and free. But there are several supported distributions, some of which have associated fees.
Yes. The Hadoop framework and eco system is mostly written in the Java programming language.
No, Hadoop is an ecosystem of software to enable the storage and processing of Big Data. NoSQL, also known as “Not Only SQL” is a term used to describe databases with some particular features for querying Big Data quickly. So you can use a NoSQL database on top of Hadoop, but Hadoop itself is not NoSQL in the same way that it is not a relational database, nor an object database.
Yes, Hadoop is an open source technology. This means it can be accessed for free via Apache, but the process is not easy and often requires an experienced computer engineer with the help of online resources to download and implement it.
Hadoop Distributed File System (HDFS) Federation improves the existing HDFS architecture. It separates the namespace which is the directory of data from the storage, the data itself. In a traditional HDFS structure, there was only one namespace for the entire cluster. Hadoop Federation allows multiple namespaces in the cluster which improves scalability and isolation. Hadoop Federation also opens up the architecture, allowing for new implementations and use cases.
It is unlikely that you would run Hadoop on a single laptop, but you could test functionality rather than scale/performance as you would have a very small cluster. You could also access cloud-based Hadoop services from your laptop. In that case, the recommendation is to look for laptops with the following attributes:
Here are a few key features of Hadoop:
Hadoop was originally conceived as a platform that required software engineers to write code to access or query data. Initially this was via a framework called MapReduce and more latterly by a number of alternative engines. For business users to query or access data in Hadoop, via client tools such as Tableau, would normally require the Hadoop implementation to support SQL.
In Hadoop, streaming means that data is being processed as it enters the data lake. As data is being queued, Hadoop is analyzing it and constantly recalculating. It is extremely quick but not quite real time yet.
Hadoop allows you to store and process large data volumes that otherwise would be cost prohibitive. There are two aspects of Hadoop. The way it stores data and the way it processes data.
Hadoop Distributed File System (HDFS) stores files as data blocks and distributes these blocks across the entire cluster. As HDFS was designed to be fault-tolerant and to run on commodity hardware, blocks are replicated a number of times to ensure data is not lost if there is a failure. Hadoop allows businesses to make as many replicas of one block as is deemed necessary.
The default size of the Hadoop block is 128 MB.
The Hadoop Ecosystem is comprised of 4 core components –
The nature of Hadoop means that data stored does not need to be formatted in any particular way. It is stored in a variety of formats, and then processed later. In this unprocessed format it is often called a ‘Data Swamp’. When it has been processed and cleaned up, it is called a ‘Data Lake’.
The Hadoop ecosystem is a set of technologies and solutions designed to run on a common platform that allow a Hadoop cluster to perform a whole range of data-related activities.
The Hadoop ‘ecosystem’ refers to the variety of products which have been developed to interact with and improve upon Hadoop’s processing capabilities. These include many open source tools like Spark, Hive, Pig, Oozie and Sqoop. There are also commercial Hadoop offerings from vendors such as Cloudera, Hortonworks and MapR, as well as other Hadoop eco-system solutions such as Kognitio, Presto, Jethro Data.
Hadoop ‘cluster’ describes all the individual physical computers or servers (nodes) that have been networked together in order to store and analyze huge amounts of unstructured and structured data. Hadoop is run across the cluster which can be hundreds and thousands of low-cost commodity computers or servers.
One server in the cluster is designated as the NameNode and another machine is the JobTracker; these are the masters. The rest of the servers in the cluster act as both DataNode and TaskTracker; these are the slaves. Additional servers can be added on at any time if there is a growing amount of data. Most importantly, in this process, data is copied onto other cluster nodes so that if there is a failure in one of the servers, no data is lost.
Hadoop combines many individual low cost industry-standard servers to create a powerful distributed data storage and processing platform.
A basic Hadoop cluster will usually use HDFS as the data storage layer combined with some sort of MPP data processing engine. Initially Hadoop used MapReduce as its data processing and query engine although this has now largely been superseded by more performant solutions.
Hadoop follows a Master Slave architecture. The master node for data storage in Hadoop HDFS is the NameNode, and the master node for parallel processing of data in Hadoop MapReduce is the Job Tracker. The slave nodes are the other servers in the Hadoop cluster which store data and perform complex computations. Every slave node has a Task Tracker daemon and a DataNode that synchronizes the processes with the Job Tracker and NameNode respectively. The master or slave systems can be setup in the cloud or on-premise.
Hadoop 2.0 introduced YARN (Yet another resource negotiator) which allows multiple Hadoop applications to share a common infrastructure.
The Hadoop stack is the underlying HDFS technology plus any other technology from the Hadoop ecosystem that can be added on to it in order to enhance its performance depending on the business needs and uses for the data.
Image credit to edureka
Traditionally the view has been that Hadoop cannot replace data warehousing because Hadoop is too slow to query for business users.
However there are now high-performance SQL query engines available for Hadoop that can allow Hadoop to provide better interactive query performance than a traditional data warehouse solution.
Hadoop 3.0 was released in December 2017
The first version of Hadoop was released by Doug Cutting and Mike Cafarella in April 2006.
The co-founder of Hadoop, Doug Cutting named Hadoop after his son’s toy elephant.
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Phoenix, NiFi, Nifi Registry, HAWQ, Zeppelin, Slider, Mahout, MapReduce, HDFS, YARN, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries. All other trademarks and registered trademarks are the property of their respective owners.