Hadoop Processes and Functions

What is combiner in Hadoop?

Combiner is an optional technology that can be added to MapReduce in Hadoop to optimize bandwidth usage. Combiner sits in between the Map job and the Reducer. The Combiner will receive as input all data emitted by the Mapper. The output from the Combiner is then sent to the Reducers. Combiner is a “mini-reduce” process which operates only on data generated by one server.

What is federation in Hadoop?

Hadoop Distributed File System (HDFS) Federation improves the existing HDFS architecture. It separates the namespace, which is the directory of data, from the storage, the data itself. In a traditional HDFS structure, there was only one namespace for the entire cluster. Hadoop Federation allows multiple namespaces in the cluster which improves scalability and isolation. Hadoop Federation also opens up the architecture, allowing for new implementations and use cases.

What is format NameNode in Hadoop?

The NameNode in Hadoop is the process which controls HDFS, the distributed file storage system in Hadoop. Formatting the NameNode is the process of initializing the whole file system, removing all files and making it ready for new files to be added.

You would normally format a NameNode after creating a brand new Hadoop cluster, but this is not normally necessary when using a Hadoop distribution like MapR, Hortonworks or Cloudera.

What is file formats in Hadoop?

File formats are how information is stored in a file so PNG, JPG, and GIF are common formats, for example. Hadoop’s filesystem includes all of these traditional storage formats but it also includes its own unique file formats to use for structured and unstructured data. Choosing an appropriate file format in Hadoop means that data can be stored and processed much more efficiently. Which file format you use depends on the purpose for your data set and what you are trying to achieve.

Some common storage formats for Hadoop include:

  • Plain text storage (eg, CSV, TSV files)
  • Sequence Files
  • Avro
  • Parquet

What is fsck in Hadoop?

The command fsck will run a health check on Hadoop Distributed File System similar to the Linux fsck command to check a file system. It will identify missing or corrupt blocks of data.

Hadoop Basics

What is a Hadoop cluster?

In Hadoop, ‘Cluster’ is used to describe all the individual physical computers or servers (nodes) that have been networked together. A Hadoop cluster is designed specifically to store and analyze huge amounts of structured and unstructured data.

What is Apache Hive?

Apache Hive™ is the default SQL-like interface for Hadoop providing data, querying and analysis.

What is Hadoop Hue?

Hue stands for Hadoop User Experience. It is an analytics workbench that supports a whole suite of applications for analyzing data with Apache Hadoop such as:

  • FileBrowser
  • Beeswax
  • Impala
  • Oozie
  • Pig
  • HBase Browser
  • Table Browser
  • Search
  • Job Browser
  • Job Designer

What is the Hadoop kill application?

Hadoop typically runs applications under YARN. YARN applications can be “killed” using the YARN resource manager GUI (using the kill button on the application’s page) or via the “yarn” command line (yarn application -kill $ApplicationId). Older versions of Hadoop which don’t have YARN, used the “hadoop” command to kill MapReduce jobs (hadoop job -kill $jobId).

How do I list files on Hadoop?

Hadoop stores files using the HDFS sub-system. Files can be listed using the “hadoop” command e.g. hadoop fs -ls /path

You can also browse Hadoop files using the NameNode GUI page using the browse files option under the utilities menu.

What is a Hadoop multi node cluster?

A Hadoop multi node cluster simply means many machines or servers connected to each other. Hadoop was created to work across a multi node cluster. That is several machines or servers in order to store and analyze big data efficiently, cost effectively and resiliently.

What are nodes in Hadoop?

In Hadoop, nodes are servers. There can be hundreds of nodes in a cluster. There are two main node types. These are the master nodes and the slave (worker) nodes.

For example, a small Hadoop cluster will include a single master and multiple slave nodes. The master node consists of a Job Tracker, Task Tracker, NameNode, and DataNode. A slave node acts as both a DataNode and TaskTracker., It is possible to have data-only and compute-only worker nodes but this is not a standard application.

In a larger cluster, you can have more than one master node with primary and secondary NameNodes. HDFS nodes are managed through a dedicated primary NameNode server to host the file system index, and a secondary NameNode that can replicate the NameNode’s memory structures, thereby preventing file-system corruption and loss of data.

Understanding the different functions of the nodes and how they work together is important in order to configure the cluster correctly for your big data needs.

What are ports in Hadoop?

Hadoop uses many ports for different functions. Some of these ports are used by Hadoop’s daemons to communicate amongst themselves (to schedule jobs, replicate blocks, etc.). Others ports are listening directly to users, either via an interposed Java client, which communicates via internal protocols, or via plain old HTTP.

Please refer to https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_ports_cdh5.html as a quick reference guide both for users, to remember the correct port number, and systems administrators, who need to configure firewalls accordingly.

What are Hadoop projects?

Apache Hadoop is the framework. Apache Hadoop projects that make up the Hadoop eco system deliver different solutions to big data problems. These are mostly with regards to migration, integration, scalability, data analytics and streaming analysis.

The Hadoop ‘ecosystem’ refers to the variety of projects which have been developed to interact with and improve upon Hadoop’s processing capabilities. These include many open source tools like Spark, Hive, Pig, Oozie and Sqoop. There are also commercial Hadoop offerings from vendors such as Cloudera, Hortonworks, Impala, Sentry and MapR.

What is a query in Hadoop?

Hadoop enables you to store and process data volumes that otherwise would be cost prohibitive.

A query is the process of interrogating the data that has been stored in Hadoop, generally to help provide business insight.

A query can be coded by an engineer / data scientist or can be a SQL query generated by a tool or application.

How to remove directory in Hadoop?

Removing a directory or file from the Hadoop Distributed File System is easy. The removal commands work similar to the analogous commands in the Linux file system. Once a directory is deleted, it is automatically transferred to the trash directory. This acts as a built-in safety mechanism protecting against accidental file and directory removal. However, you need to ensure that you have enabled the trash for this to work.

The command to remove an empty directory:
hadoop fs -rmdir directory_name

To remove a directory containing files (and all the files in it):
hadoop fs -rm -r directory_name

What is Hadoop unstructured data?

Unstructured data is data that has not been organised into any structure. Structured data has been organized into tables, rows and columns where relationships exist between the tables. Data which doesn’t have this format, such as email text, video, social data is classed as unstructured. Unstructured data is generally processed to give it structure before it is analysed. The beauty of Hadoop, is that unlike traditional data bases, it can store and manage unstructured data as well as structured data.

What is Hadoop xfs?

xfs is a Linux file system that can be used in Hadoop to store structured and unstructured data.

The Hadoop Distributed File System sits on top of an underlying file system on each node, and XFS is one of those potential file systems. XFS offers better disk space utilization than ext3 which is another file system, for example and has much quicker disk formatting times than ext3. This means that it is quicker to get started with a data node using XFS.

What is xml configuration file in Hadoop?

Hadoop stores its configuration as a set of XML files in a configuration directory whose location depends on the Hadoop distribution being used.

Useful configuration files include core-site.xml (core Hadoop configuration) and yarn-site.xml (YARN configuration).

In some Hadoop distributions you can edit these directly but in others you should go through the Hadoop distribution’s tools in order to edit them.

What are Hadoop zip files?

Hadoop deals with huge data files. And ZIP files are just one way of storing data in Hadoop. Because of the large size of the data, files could be zipped before dumping them into Hadoop. However, for data generated in Hadoop, users are more likely to use one of its compressed formats.

What is DataNode in Hadoop?

A DataNode is part of the Hadoop cluster and connects to the master server which is the NameNode. DataNode is where data is stored and processed in Hadoop and usually there are several data nodes or servers in the cluster.

What is NameNode in Hadoop?

A NameNode is a central part of the Hadoop Distributed File System (HDFS). It is a master node and there can be multiple NameNodes across a very large cluster. NameNodes keeps the directory tree of all files in the file system, and monitors where across the cluster the file data is kept. It also contains all the metadata for the data stored in the DataNodes. It doesn’t store data itself but rather is a catalogue or index for all the data in Hadoop.

What are racks in Hadoop?

Hadoop is built from clusters of individual industry-standard servers. The individual servers are housed in physical racks. A typical rack would hold between 10 and 40 individual servers depending on the server type. Typically servers in one rack would connect to a “rack network switch” which would then be connected to another central network switch. Racks make it possible to contain a lot of equipment in a small physical footprint without requiring shelving.

What is reducer in Hadoop?

MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of separate tasks. It is made up of two phases: mapping and reducing.

The first phase is mapping. A list of data elements are provided, one at a time, to the Mapper, which transforms each element separately to an output data element.

The second phase is called reducing. Reducing lets you aggregate values together. A reducer function receives input values from an input list. It then combines these values together, returning a single output value. Reducing is often used to reduce a large amount of data into a summary.

What is replication factor in Hadoop?

The Hadoop Distributed File System (HDFS) replicates blocks of data across several different nodes to ensure that if one node fails, there is always a copy of the data on another one so nothing is lost. The number of copies made is called the replication factor.

What is Hive in Hadoop?

Apache Hive is the default SQL-like interface for Hadoop providing data, querying and analysis.

What is MapReduce in hadoop?

MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster. MapReduce is no longer as relevant because it is too complicated, difficult and slow. There are now lots of other options on the Hadoop cluster.

What is Pig in Hadoop?

Apache Pig is an application in the Hadoop ecosystem. It is an open-source technology that can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Ultimately, Pig optimizes and quickens the process of extracting data from Hadoop.

Hadoop Basic Commands

What are Hadoop HDFS Commands?

Running commands in the Hadoop Shell is essential. Whether you are moving data in HDFS or changing the configuration file of the cluster, all of these tasks can be done from the Hadoop Shell by programming in commands.

The File System (FS) shell includes many shell-like commands that interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, WebHDFS, S3 FS, and others. All FS commands begin with the bin/hdfs script.

What are Hadoop FS Commands?

Hadoop FS commands are File System commands. The File System (FS) shell includes various commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS, and others. All FS commands begin with the bin/hdfs script.

What is Hadoop Replication?

The Hadoop Distributed File System (HDFS) replicates blocks of data across several different nodes to ensure that if one node fails, there is always a copy of the data on another one so nothing is lost. The number of copies made is called the replication factor.

What are the Hadoop Shell Commands?

Common Hadoop Shell commands are:

  • hadoop (mostly for file operations)
  • hdfs (for file and file-system wide operations)
  • yarn (for controlling applications)

Each of these has sub-commands which are given as the first argument (e.g. “hadoop fs”) with additional subcommand specific arguments being supplied. Running without any arguments shows a list of subcommands.

What is Hadoop High Availability?

In a traditional Hadoop cluster, there is only one master server, the NameNode which acts as a directory of all the data available on the DataNodes. If one of the DataNodes fails, Hadoop can still function as data is always replicated to another node. However, if the NameNode fails, the whole application comes to a halt. To ensure high availability, additional NameNodes or master servers can be added to the cluster, so that if one fails, the other master server can step in.

What is Hadoop Version Command?

The “hadoop version” command will show you what version of Hadoop you are running.

Can I install Hadoop on my pc?

Hadoop is normally installed on Linux and can be installed on any PC running Linux. If you’re running another operating system you could install Hadoop on a virtual machine.

How do I install Hadoop on Linux?

Hadoop is packaged up by many different vendors in many different ways and each of these Hadoop distributions has its own installation procedure. You should go to that Hadoop distributor’s website to find installation guides.

To install Apache Hadoop, go to the Apache website and follow their instructions:

Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Phoenix, NiFi, Nifi Registry, HAWQ, Zeppelin, Slider, Mahout, MapReduce, HDFS, YARN, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries. All other trademarks and registered trademarks are the property of their respective owners.