Hadoop Components

What is HDFS?

Hadoop HDFS or Hadoop Distributed File System is a distributed file system which provides storage in Hadoop in a distributed fashion.

In Hadoop Architecture on the master node, a daemon called namenode run for HDFS. On all the slaves a daemon called datanode run for HDFS. Hence slaves are also called as datanode. Namenode stores meta-data and manages the datanodes. On the other hand, Datanodes stores the data and do the actual task.

Hadoop Tutorial - Hadoop HDFS Architecture

HDFS is a highly fault-tolerant, distributed, reliable and scalable file system for data storage. First Follow this guide to learn more about features of HDFS and then proceed further with the Hadoop tutorial.

HDFS is developed to handle huge volumes of data. The file size expected is in the range of GBs to TBs. A file is split up into blocks (default 128 MB) and stored distributedly across multiple machines. These blocks replicate as per the replication factor. After replication, it stored at different nodes. This handles the failure of a node in the cluster. So if there is a file of 640 MB, it breaks down into 5 blocks of 128 MB each (if we use the default value).

What is MapReduce?

In this Hadoop Basics Tutorial, now its time to understant one of the most important pillar if Hadoop, i.e. Hadoop MapReduce. The Hadoop MapReduce is a programming model. As it is designed for large volumes of data in parallel by dividing the work into a set of independent tasks. MapReduce is the heart of Hadoop, it moves computation close to the data. As a movement of a huge volume of data will be very costly. It allows massive scalability across hundreds or thousands of servers in a Hadoop cluster.

Hence, Hadoop MapReduce is a framework for distributed processing of huge volumes of data set over a cluster of nodes. As data stores in a distributed manner in HDFS. It provides the way to Map–Reduce to perform parallel processing.

What is YARN Hadoop?

YARN – Yet Another Resource Negotiator is the resource management layer of Hadoop. In the multi-node cluster, as it becomes very complex to manage/allocate/release the resources (CPU, memory, disk). Hadoop Yarn manages the resources quite efficiently. It allocates the same on request from any application.

On the master node, the ResourceManager daemon runs for the YARN then for all the slave nodes NodeManager daemon runs.

Learn the differences between two resource manager Yarn vs. Apache Mesos. Next topic in the Big Data Hadoop for beginners is a very important part of Hadoop i.e. Hadoop Daemons

Hadoop Daemons

Daemons are the processes that run in the background. There are mainly 4 daemons which run for Hadoop.

Hadoop Daemons

Namenode – It runs on master node for HDFS.
Datanode – It runs on slave nodes for HDFS.
ResourceManager – It runs on master node for Yarn.
NodeManager – It runs on slave node for Yarn.

These 4 demons run for Hadoop to be functional. Apart from this, there can be secondary NameNode, standby NameNode, Job HistoryServer, etc.

How do Hadoop works?’

Till now in Hadoop training we have studied Hadoop Introduction and Hadoop architecture in detail. Now next Let us summarize Apache Hadoop working step by step:

i) Input data breaks into blocks of size 128 Mb (by default) and then moves to different nodes.

ii) Once all the blocks of the file stored on datanodes then a user can process the data.

iii) Then, master schedules the program (submitted by the user) on individual nodes.

iv) Once all the nodes process the data then the output is written back to HDFS.

Hadoop Flavors

This section of Hadoop Tutorial talks about the various flavors of Hadoop.

Apache – Vanilla flavor, as the actual code is residing in Apache repositories.
Hortonworks – Popular distribution in the industry.
Cloudera – It is the most popular in the industry.
MapR – It has rewritten HDFS and its HDFS is faster as compared to others.
IBM – Proprietary distribution is known as Big Insights.

All the databases have provided native connectivity with Hadoop for fast data transfer. Because, to transfer data from Oracle to Hadoop, you need a connector.

All flavors are almost same and if you know one, you can easily work on other flavors as well.

Hadoop Ecosystem Components

In this section of Hadoop tutorial, we will cover Hadoop ecosystem components. Let us see what all the components form the Hadoop Eco-System:

Hadoop Tutorial - Ecosystem components of Apache Hadoop

Hadoop Ecosystem Components

Hadoop HDFS – Distributed storage layer for Hadoop.
Yarn Hadoop – Resource management layer introduced in Hadoop 2.x.
Hadoop Map-Reduce – Parallel processing layer for Hadoop.
HBase – It is a column-oriented database that runs on top of HDFS. It is a NoSQL database which does not understand the structured query. For sparse data set, it suits well.
Hive – Apache Hive is a data warehousing infrastructure based on Hadoop and it enables easy data summarization, using SQL queries.
Pig – It is a top-level scripting language. As we use it with Hadoop. Pig enables writing complex data processing without Java programming.
Flume – It is a reliable system for efficiently collecting large amounts of log data from many different sources in real-time.
Sqoop – It is a tool design to transport huge volumes of data between Hadoop and RDBMS.
Oozie – It is a Java Web application uses to schedule Apache Hadoop jobs. It combines multiple jobs sequentially into one logical unit of work.
Zookeeper – A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
Mahout – A library of scalable machine-learning algorithms, implemented on top of Apache Hadoop and using the MapReduce paradigm.

Refer this Hadoop Ecosystem Components tutorial for the detailed study of All the Ecosystem components of Hadoop.

Conclusion

In conclusion to this Big Data tutorial, we can say that Apache Hadoop is the most popular and powerful big data tool. Big Data stores huge amount of data in the distributed manner and processes the data in parallel on a cluster of nodes. It provides world’s most reliable storage layer- HDFS. Batch processing engine MapReduce and Resource management layer- YARN. 4 daemons (NameNode, datanode, node manager, resource manager) run in Hadoop to ensure Hadoop functionality.